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Preface: 

The econometric disipline has been criticized for being too similar to mathematical statistics and only to 
a limited degree linked to formalized theoretical models. This is particularly the case as regards 
formulation and specification of the stochastic elements in econometric models. Ragnar Frisch, who is 
known to be the originator of econometrics, expressed both in theory and practice an opposite ideal; 
namely econometrics as an almost symbiotic blend of statistical methodology and mathematically 
formulated theory, cf. Frisch (1926). See also Bjerkholt (1995). 

Theory and econometric methodology for qualitative choice behavior is developed in a 
tradition which | believe is somewhat closer to the ideal of Frisch than much of the traditional textbook 
approach to econometrics. This stems from the fact that the theory of qualitative choice is rooted ina 
tradition where probabilistic concepts and formulations play a key role in contrast to the point of 
departure in traditional micro theory, which is deterministic. Since probabilistic concepts are integral 
parts of the theory of qualitative choice this means that the gap between theory and empirical model 
specification in applications often becomes less wide than is the case in the traditional micro-economic 
approach. 

The present compendium is a revised version of an introductory course in the theory of 
qualitative choice behavior (often called the theory of discrete choice). Some of the material | present 
here draws on a Ph.D. course | gave at the Department of Economics, University of Wisconsin, during 
the Fall semester of 1990. 


Acknowledgement: | acknowledge the helpful comments of Steinar Stram and Anne Skoglund for 
word processing assistance. 
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1. Introduction 

The traditional theory for individual choice behavior, such as it usually is presented in textbooks of 
consumer theory, presupposes that the goods offered in the market are infinitely divisible. However, 
may important economic decisions involve choice among qualitative—or discrete alternatives. 
Examples are choice among transportation alternatives, labor force participation, family size, 
residential location, type and level of education, brand of automobile, etc. In transportation analyses, 
for example, one is typically interested in estimating price and income elasticities to evalutate the 
effect from changes in alternative-specific attributes such as fuel prices and user-cost for automobiles. 
In addition, it is of interest to be able to predict the changes in the aggregate distribution of 
commuters that follow from introducing a new transportation alternative, or closing down an old one. 

The set of alternatives may be "structurally" discrete or only "observationally" discrete. The 
set of feasible transportation alternatives is an example of a structurally categorical setting while 
different levels of labor supply such as "part time", and "full time" employment may be interpreted as 
only observationally discrete since the underlying set of feasible alternatives, "hours of work”, is a 
continuum. 

In several applications the interest is to model choice behavior for so-called 
discrete/continuous settings. Typical examples of phenomena where the response is 
discrete/continuous are variants of consumer demand models with corner solutions. Here the discrete 
choice consists in whether or not to purchase a positive quantity of a specific commodity, and the 
continuous choice is how much to purchase, given that the discrete decision is to purchase a positive 
amount. Another type of application is the demand for durables combined with the intensity of use. 
For example, a consumer that purchases an automobile has preferences over the intensity of use, and a 
household that purchases an electric appliance is also concerned with the intensity of use of the 
equipment. 

The recent theory of probabilistic, or discrete/continuous choice is designed to model these 
kind of choice settings, and to provide the corresponding econometric methodology for empirical 
analyses. Due to variables that are unobservable to the econometrician (and possibly also to the 
individual agents themselves), the observations from a sample of agents’ discrete choices can be 
viewed as outcomes generated by a stochastic model. Statistically, these observations can be 
considered as outcomes of multinomial experiments, since the alternatives typically are mutually 
exclusive. In the context of choice behavior, the probabilities in the multinomial model are to be 
interpreted as the probability of choosing the respective alternatives (choice probabilities), and the 
purpose of the theory of discrete choice is to provide a structure of the probabilities that can be 
justified from behavioral arguments. Specifically, one is, analogously to the standard textbook theory 


of consumer behavior, interested in expressing the choice probabilities as functions of the agents' 


preferences and the choice constraints. The choice constraints are represented by the usual economic 
budget constraint and in addition, the choice set (possibly individual specific), which is the set of 
alternatives that are feasible to the agent. For example, in transportation modelling some commuters 
may have access to railway transportation while others may not. 

In the last 25 years there has been an almost explosive development in the theoretical and 
methodological literature within the field of discrete choice. Originally, much of the theory was 
develop by psychologists, and it was not until the mid-sixties that economists startet to adopt and 
adjust the theory with the purpose of analyzing discrete choice problems. In the present compendium 
we shall discuss central parts of the theory of discrete/continuous choice as well as some of the 
econometric methods that apply. 

In contrast to standard textbooks and surveys in econometric modelling of discrete choice 
such as Maddala (1983), Train (1986), Amemiya (1981), McFadden (1984) and Ben-Akiva and 
Lerman (1985), the focus of the present treatment is more on the theoretical developments than on 
statistical methodology. The reason for this is two-fold. First, it is believed that it is of substantial 
interest to bring forward some of the recent theoretical results that otherwise would not be easily 
accessible for the non-expert student. Second, the statistical methodology for estimation, testing and 
diagnostic analysis is rather well covered by the textbooks and surveys mentioned above.’ 

This survey is organized as follows: In chapter 2 I give a brief overview of reduced form type 
specifications and estimation of models with discrete or limited dependent response. In chapter 3 I 
discuss some important elements of probabilistic choice theory, and in chapter 4 the issue is 
functional forms and econometric specification of discrete choice models. In chapter 5 I discuss the 
modeling of a few selected applications of discrete choice analysis, and in chapter 6 the extension to 
discrete/continuous choice model is treated. In the final chapter I discuss applications on 


discrete/continuous modeling. 


' An elementary survey in Norwegian is Dagsvik (1985). 


2. Statistical analysis when the dependent variable is discrete 

As mentioned in the introduction there are many interesting phenomena which naturally can be 
modelled with a dependent variable being qualitative (discrete) or where the dependent variables may 
be both discrete and continuous. 

While most of the subsequent chapters will discuss theoretical aspects of discrete/continuous 
choice, we shall in this chapter give a brief summary of the most common statistical models and tools 
which are useful for analyzing such phenomena, without assuming that the underlying response 
variables necessarily are generated by agents that make decisions. A more detailed exposition is found 
in Maddala (1983), chapter one and two. However, the statistical methodology we discuss 1s of 
relevance for estimating the choice models for agents (consumers, firms, workers, etc.), and will be 


further discussed in subsequent chapters. 


2.1. Models with discrete response 


… tv mM n 


When analyzing "demand for housing", "tourist destinations”, "type of accident”, etc. the 
response—or dependent variable—is typically discrete and it often has the structure of a binomial, or 
more generally, a multinomial variable. Recall that in multinomial experiments with m possible 
categories only one out of m outcomes can occur in each experiment. In other words, the outcomes are 
mutually exclusive. For example, out of m possible housing alternatives the household will only select 
one. Similarly, a student who has the choice between m different schools will only select one. 


Statistically, a multinomial model is represented by probabilities, P,, j= 1,2,...,m, where Pj is the 


probability that outcome j shall occur. 


Let Y; denote the corresponding response variable, where Y, =1 if outcome j occurs and zero 


otherwise. (For simplicity, we suppress the indexation of the agent.) Then 


EY, =P(Y, =1)-1+P(Y, =0)-0=P(Y, =1)=P,. We can therefore write 
(2.1) Y, =P, +e; 


where fe i} are random terms with zero mean. Thus, once the systematic term P; has been specified as 


a function of explanatory variables, one could estimate the unknown parameters by regression 


analysis. However, it is problematic to specify the probabilities IP, } as linear functions of the 


explanatory variables due to the fact that a linear specification does not necessarily satisfy the 


constraints that OS P, <1, and > P,=1 (cf. Maddala, 1983, pp. 15-16, or Greene, 1990, pp. 636- 


441). 


Example 2.1 

Consider the modelling of labor force participation. In this case m=2, where alternative one 
represents participation, while alternative two represents nonparticipation. It 1s believed that a number 
of factors, such as age, marital status, number of small children, education, etc., explain the outcome. 


Let X be the vector of relevant (observable) variables that explain the outcome. Thus 


(2.2) P, = w(XB) 


where (-) is a suitable chosen functional form while B is a vector of unknown parameters. If one 
could estimate B it would for example be possible to assess the marginal effect of education on the 


labor force participation. We realize that w(-) must be positive and 0< y(-) <1. 


2.1.1. The multinomial Logit model 
One convenient and commonly used specification that fulfills the above restrictions is the multinomial 


logit model. One version of the multinomial logit model has the structure 


exp(XB;) 


pe SÅR (XB) | 


where X is, typically, a vector of agent-specific variables and B z j=1,2,...,m, are vectors of 


(2.3) P, =H,(X)= 


unknown parameters. This specification is also convenient for estimation purposes as we shall see 
below. 


From (2.3) it follows that 





H;(X) 
(2.4) oe 3 )=X(6, Bs) 


Eq. (2.4) demonstrates that at most B; —B, can be identified. To realize this, suppose 


B j» J=12,...,m, are parameter vectors such that B ; 7 B j- If 


B; =B; —B, +B; 


for j=2,...,m, then 18; } will satisfy (2.4), and consequently iB i} are not identified. We can 


therefore, without loss of generality, put B, =0, and write 


(2.5a) -HO 


+>, exp(XB, ) 


=? 


ane 


and 


mes (XB; ) 


(2.5b) | H (X) =—— 
+> exp(XB, ) 
k=2 


for j= 2,3,...,m. 


Example 2.2 

Consider the choice of tourist destination. Suppose there are m actual destinations. We 
assume that actual variables that influence this choice are age, income, education, marital status, 
family size, etc. Let X be the vector of these variables. The probability of choosing destination j can 


now be modelled as in (2.5). 


2.1.2. The binary Probit model 

The binary probit model is often motivated by a latent variable specification such as in (2.7), but with 
u normally distributed instead of logistically distributed. If @(-) denotes the cumulative normal 
distribution, N(0,1), then the probit model follows by replacing L(y) by ®(y) in which case we obtain 


the binary Probit model as 


XB 2 
l 


The normal and the logistic distributions are rather close, and in most applications one has found that 
the binary logit and probit models are almost indistinguishable. 

In case there are extreme values of the explanatory variables the predictions from the logit and 
probit model conditional on these extreme values may, however, differ since the logistic distribution 


has slightly heavier tails than the normal distribution. 


2.1.3. Binary models derived from latent variable specifications 
For the sake of motivation let us reconsider Example 2.1. Let now Uj be the individual ’s utility of 


alternative j, J= 1,2, and let 


(2.7) U,=XB, +u; 


where u; is a random variable that is supposed to capture unobserved variables that affect the utility of 


alternative j. Let 


(2.8) Y*=U, -U =Xß-u 


where B=B, —B, and u=u, —u,. Let w(y)=P(usy), be the cumulative distribution function of 


u, which we assume is independent of X. Consistent with the notation in Example 2.1, let the 


observable variable, Y>, be given by 


7 = 


1 if Y >0 
O otherwise 


and Y, =1— Y,. From (2.8) it follows that the probability of participation equals 


P, =P(Y,=1)=P(Y">0) 
=P(XB-u>0)=P(XB>u)= y (XB). 
If w(y)=(y), where ®(-) is given by (2,6), the Probit model follows, whereas if 
2.9 = — 
(2.9) wy) Trexp(cy) 


the binary Logit model follows. The distribution function (2.9) is known as the Logistic distribution. 
For example, in the labor force participation example, Y” may be interpreted as the difference 
between the agent’s (expected) market wage and the reservation wage. This, and further examples will 


be discussed in chapter 5. 
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3. Theoretical developments of probabilistic choice models 


3.1. Random utility models 

As indicated above, the basic problem confronted by discrete choice theory is the modelling of choice 
from a set of mutually exclusive and collectively exhaustive alternatives. In principle, one could apply 
the conventional microeconomic approach for divisible commodities to model these phenomena but a 
moment’s reflection reveals that this would be rather ackward. This is due to the fact that when the 
alternatives are discrete, it is not possible to base the modelling of the agent’s chosen quantities by 
evaluating marginal rates of substitution (marginal calculus), simply because the utility function will 
not be differentiable. In other words, the standard marginal calculus approach does not work in this 


case. Consequently, discrete choice analysis calls for a different approach. 


3.1.1. The Thurstone model | 
Historically, discrete choice analysis were initiated by psychologists. Thurstone (1927) proposed the 
Thurstone model to explain the results from psychological and psychophysical experiments. These 
experiments involved asking students to compare intensities of physical stimuli. For example, a 
student could be asked to rank objects in terms of weights, or tones in terms of loudness. The data 
from these experiments revealed that there seemed to be the case that some students would make 
different rankings when the choice experiments were replicated. To account for the variability in 
responses, Thurstone proposed a model based on the idea that a stimulus induces a "psychological 
state” that is a realization of a random variable. Specifically, he represented the preferences over the 
alternatives by random variables, so that the individual decision-maker would choose the alternative 
with the highest value of the random variable. The interpretation is two-fold: First, the utilities may 
vary across individuals due to variables that are not observable to the analyst. Second, the utility of a 
given alternative may also vary from one moment to the next, for the same individual, due to 
fluctuations in the individual’s psychological state. As a result, the observed decisions may vary 
across identical experiments even for the same individual. 


In many experiments Thurstone asked each individual to make several binary comparisons, 


and he represented the utility of each alternative by a normally distributed random variable. Let Ui 


and U$ denote the utilities a specific individual associates with the alternatives in replication no. i, 


1=1,2,...,m.Thurstone assumed that 


i i 
U; =v; +£; 
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where ej ,J=1,2,1=1,2,...,m, are independent and normally distributed with zero mean and standard 
deviation equal to 6. Thus according to the decision rule the individual would choose alternative one 
in replication 1 if Ui is greater than U} . Due to the "error term", e} , the individual may make 
different judgments in replications of the same experiment. Let Y; = lif alternative j is chosen in 


replication 1 and zero otherwise. The relative number of times the individual chooses alternative j, 


A 


P,, equals 


j=1,2. When the number of replications increases, then it follows from the law of large numbers that 


P tends towards the theoretical probability; 


Vi —V2 


yo? +05 


where (y) is the standard cumulative normal distribution. The last equality in (3.1) follows from the 


(3.1) P, =P(U;>U;)=06 


assumption that the error terms are normally distributed random variables. The probability in (3.1) 
represents the propensity of choosing alternative j and it is a function of the standard deviations and 
the means, v;. While v; repesents the "average" utility of alternative j the respective standard 
deviations account for the degree of instability in the individuals preferences across replicated 
experiments. We recognize (3.1) as a version of the binary probit model. 

Although Thurstone suggested that the above approach could be extended to the multinomial 
choice setting, and with other distribution functions than the normal one, the statistical theory at that 


time was not sufficiently developed to make such extensions practical. 


3.1.2. The neoclassisist’s approach 

The tradition in economics is somewhat different from the psychologist’ s approach. Specifically, the 
econometrician usually is concerned with analyzing discrete data obtained from a sample of 
individuals. With a neoclassical point of departure, the preferences are typically assumed to be 
deterministic from the agent’ point of view, in the sense that if the experiment were replicated, the 
agent would make identical decisions. In practice, however, one may observe that observationally 
identical agents make different choices. This is explained as resulting from variables that affect the 
choice process and are unobservable to the econometrician. The unobservables are, however, assumed 


to be perfectly known to the individual agents. Consequently, the utility function is modeled as 
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random from the observing econometricians point of view, while it is interpreted as deterministic to 
the agent himself. Thus the randomness is due to the lack of information available to the observer. 
Thus, in contrast to the psychologist, the neoclassical economist seems usually reluctant to interpret 
the random variables in the utility function as random to the agent himself. Since the economist often 
does not have access to data from replicated experiments, he is not readily forced to modify his point 


of view either. There are, however, exceptions, see for example Quandt (1956). 


3.1.3. General systems of choice probabilities 


Formally, we shall decribe a system of choice probabilities as follows: 


Definition 1; System of choice probabilities 
(i) A univers of choice alternatives, S. Each alternative in S may be characterized by a set of 
variables which we shall call attributes. 
(11) A set of agent-specific characteristics. 
(iii) A random utility function U;, where U; is the agent’s utility of alternative J, je S, and a 


distribution function M which yields the joint distribution of the utilities in S, i.e., 
(3.2) M (u, ,u,....)=P(U, Su, ,U, Suy,....). 


From the assumptions above it is possible in principle to derive the system of choice 


probabilities, {P,(B)}, where P;(B) is defined by 
(3.3) P; (B) = P(U, =max U, } 
keB 


and je BCS. The interpretation of (3.3) is as the probability that the agent will choose alternative j 
when the set of feasible alternatives are equal to B. It is important to stress that a choice probability is 
a function of two arguments, namely j and B. For each given B, P;(B), je B, are multinomial 
probabilities. The relationship between P;(B) and P,(A) for two different choice sets A and B is 
governed by the joint distribution of the utilities. As explained above, the empirical counterpart of 
P;(B) is the fraction of individuals with observationally identical characteristics that have chosen 
alternative j from B. 


Often , the random utilities are assumed to have an additively separable structure, 
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where v; is a deterministic term and &; is a random variable with joint distribution of the terms fe j} 


assumed to be independent of {v ; ig In empirical applications the deterministic terms are specified as 


functions of observable attributes and individual characteristics. 

Similarly to Manski (1977) we may identify the following sources of uncertainty that 
contribute to the randomness in the preferences: 

(i) Unobservable attributes: The vector of attributes that characterize the alternatives may only 
partly be observable to the econometrician. | 

(ii) Unobservable individual-specific characteristics:Some of the variables that influence the 
variation in the agents tastes may partly be unobservable to the econometrician. 

(iii) Measurement errors: There may be measurement errors in the attributes, choice sets and 
individual characteristics. 

(iv) Functional misspecification: The functional form of the utility function and the distribution of 
the random terms are not fully known by the observer. In practice, he must specify a parametric 
form of the utility function as well as the distribution function which at best are crude 
approximations to the true underlying functional forms. 

(v) Bounded rationality: We may go along with the psychologists point of view in allowing the 
utilities to be random to the agent himself. In addition to the assessment made by Thurstone, 
there is an increasing body of empirical evidence, as well as common daily life experience, 
suggesting that agents in the decision-process seem to have difficulty with assessing the precise 
value of each alternative. Furthermore, their preferences may change from one moment to the 
next in a manner that is unpredictable (to the agents themselves). 

To summarize, it is possible to interpret the randomness of the agents utility functions as 
partly an effect of unobservable taste variation and partly an effect that stem from the agents difficulty 
of dealing with the complexity of assessing the proper value to the alternatives. In other words, it 
seems plausible to interpret the utilities as random variables both to the observer as well as to the 
agent himself. In practice, it will seldom be possible to identify the contribution from the different 
sources to the uncertainty in preferences. For example, if the data at hand consists of observations 
from a cross-section of consumers, we will not be able to distinguish between seemingly inconsistent 
choice behavior that results from unobservables versus preferences that are uncertain to the agents 
themselves. 

Before we discuss the random utility approach further we shall next turn to a very important 


contribution in the theory of discrete choice. 
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3.2. The Luce model 

Luce (1959) introduced a class of probabilistic discrete choice model that has become very important 
in many fields of choice analyses. Instead of Thurstone’s random utility approach, Luce postulated a 
structure on the choice probabilities directly without assuming the existence of any underlying 
(random) utility function. Recall that P;(B) means the probability that the agent shall choose 
alternative j from B when B is the choice set. Statistically, for each given B, recall that these are the 
probabilities in a multinomial model, (due to the fact that the choices are mutually exclusive), which 
sum up to one. However, the question remains how these probabilities should be specified as a — 


function of the attributes and how the choice probabilities should depend on the choice set, 1.e., in 


other words, how are JP, (B) band {P, (A)} related when je BOA ? To deal with this challenge, Luce 


proposed his famous Choice Axiom, which has later been known as the IIA property; "Independence 
from Irrelevant Alternatives". To describe ITA we think of the agent as if he is organizing his decision- 
process in two (or several) stages: In the first stage he selects a subset A from B, where A contains 
alternatives that are preferable to the alternatives in B\A. In the second stage the agent subsequently 
chooses his preferred alternative from A. So far this entails no essential loss of generality, since it is 
usually always possible to think of the decision process in this manner. The crucial assumption Luce 
made is that, on average, the choice from A in the last stage does not depend on alternatives outside 
A; the alternatives discarded in the first stage has been completely "forgotten" by the agents. In other 
words, the alternatives outside A are irrelevant. A probabilistic statement of this property is as 


follows: Let P,(B) denote the probability of selecting a subset A from B, defined by 


P,(B)= >) P,(B). 


JEA 
Definition 2; Independence from irrelevant alternatives (IIA) 
A system of choice probabilities, {P, ( By}, with P,.(B)#0,1, satisfies IIA if and only if for all 


j, A, B such that jE AC BCS, 


(3.5) P, (B) = P, (B) P; (4). 


Eq. (3.5) states that the probability of choosing alternative j from B equals the probability of 
selecting a subset A of the "best" alternatives in stage one times the probability of selecting 
alternative j from A in the second stage. Notice that the second stage probability, P;(A), has the same 
structure as P;(B), i.e.,it does not depend on alternatives outside the (current) choice set A. Note that 


since this is a probabilistic statement it does not mean that IIA should hold in every single experiment. 
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It only means that is should hold on average, when the choice experiment is replicated a large number 
of times, or alternatively, it should hold on average in a large sample of "identical" agents. (In the 
sense of agents with identically distributed tastes.) We may therefore think of ILA as an assumption of 
"probabilistic rationality”. 

It may be instructive for the sake of clarification of the IIA property to consider the 
relationship between P,(B) and the conditional choice probability given that the chosen alternative 
belongs to B. More specifically, suppose for example that the universal set S is feasible. Then the 
conditional choice probability that alternative j is chosen, given that the chosen alternative belongs to 


BCS, equals 


P,(S) 
P(S)’ 





which only coincides with P,(B) when ITA holds. While P;(B) expresses the probability that j is chosen 


when the choice set equals B, P,(S)/P,(S) expresses the probability that j is chosen when the choice 
set is S, given that the chosen outcome belongs to B. The empirical counterpart to P, (S) /Ps (S) is the 


number of agents that face choice set S and have chosen j, to the number of agents that face choice set 


S and whose choice outcomes belong to B. 


Definition 3; The Constant-Ratio Rule 


A system of choice probabilities, {P, ( By} satisfies the constant-ratio rule if and only if for 


all j, k, B such that j, ke BCS, 


(3.6) P, (fk, j})/ P, (Ik, j))= P, (B)/P, (B) 


provided the denominators do not vanish. 
The following results are due to Luce (1959): 


Theorem 1 


Suppose iP, ( B)} is a system of choice probabilities. Then the IIA assumption holds if and 


only if there exist positive scalars, aĝ), j €S, such that the choice probabilities equal 


keB 
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Moreover, the scalars {a(j)} are unique apart from multiplication by a positive constant. 


Proof: Assume first that (3.7) holds. Then it follows immediately that (3.5) holds. Assume 


next that (3.5) holds. Define a(j)=c P; (S), where c is an arbitrary positive constant. Then by (3.5) 


with B=S and A=B, we obtain 





(S$  a(j)c a(j) 
PCR a 
na P (S) X a(k)c X a(k) 
keB keB 


where BCS. This shows that P;(B) has the structure (3.7). 


To show uniqueness (apart from multiplication by a constant), let 4@(j) be positive scalars 


such that (3.7) holds with a(j) replaced by 4(j). Then with B=S we get 


P(S a) ao 


P(S) a(l) &(1) 


which implies that 


se . ad) 
a(j) =a(j)- ——. 
(j) =a()) xD) 
Thus we have proved that IIA implies the existence of scalars fa( J), JE S}, such that (3.7) holds and 


these scalars are unique apart from multiplication by a constant. 
Q.E.D. 


Theorem 2 


Let iP, ( By} be a system of choice probabilities. The Constant-Ratio Rule holds if and only if 


ITA holds. 


Proof: The constant ratio rule implies that for jke ACBCS 


PB) _ P,({ik}) _ PCA) 


P(B) P.({j,k}) P(A) 








Hence, since 


P. (B) P, (A) = P; (A) P, (B) 
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and 


2, P,(A)=1, 


keA 


we obtain 


P\(B)=P,(B) È, P,(A)=P;(A) X, P,(B)=P,(A)P, (B). 


keA keA 


Conversely, if IIA holds we realize immediately that the constant ratio rule must hold. 
Q.E.D. 


The results above are very powerful in that they establish statements that are equivalent to the 
IIA assumption, and they yield a simple structure of the choice probabilities. For example, if the 
univers S consists of four alternatives, S = {1,2,3,4}, there will be at most 11 different choice sets, 
namely {1,2}, {1,3}, {2,3}, {1,4}, {2,4}, {3,4}, {1,2,3}, {1,2,4}, {1,3,4}, {2,3,4}, {1,2,3,4}. This 
yields altogether 28 probabilities. Since the probabilities sum to one for each choice set we can reduce 
the number of "free" probabilities to 17. However, when IIA holds we can express all the choice 
probabilities by only three scale values, a2, a3 and a, (since we can choose a;=1). We therefore realize 
that the Luce model implies strong restrictions on the system of choice probabilities. 

There is another interesting feature that follows from the Luce model, expressed in the next 


Corollary. 


Corollary 1 
If IIA holds it follows that for distinct i, j and ke S 


(3.8) P(A) PUD PD) =F EDP DP, (GA). 


The proof of this result is immediate. 
Recall that ITA only implies rationality "in the long run”, or at the aggregate level. Thus the 
probability of intransitive sequences (chains) is positive. The result in Corollary 1 is a statement about 


intransitive chains beause the interpretation of (3.8) is that 
P(i>j>k>i)=P(i>k>j>i) 


where > means "preferred to". In other words, the intransitive chains i> j>k>i and i>k>j>i 


have the same probability. This shows that although intransitive "chains" can occur with positive 
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probability there is no systematic violation of transitivity. In fact, it can also be proved that if (3.8) 


holds then the binary choice probabilities must have the form 


aa a(j) 
(3.9) Pith)" ea 


where fa( j), j€ S} are unique up to multiplication by a constant, cf. Luce and Suppes (1965). 


However, (3.8) does not imply IIA. Equation (3.8) is often called the Product rule. 


3.3 The relationship between IIA and the random utility formulation 

After Luce had introduced the IIA property and the corresponding Luce model, Luce (1959), the 
question whether there exists a random utility model that is consistent with ITA was raised. A first 
answer to this problem was given by Holman and Marley in an unpublished paper (cf. Luce and 


Suppes, 1965, p. 338). 


Theorem 3 


Assume a random utility model, U, =loga(j)+é,, where €,, jES, are i. id. according to 


the standard type III extreme value distribution 


(3.10) Ple, < x)= exp(-e™*). 


Then, for jE BCS, 


(3.11) P; (B) = P v, Sn H SSG 
keB 


Thus, by Theorem 3 there exists a random utility model that rationalizes the Luce model. 


Proof: Let us first derive the cumulative distribution for V; =max,.g\,;;, Ux - We have 


(3.12)  P(V,<y)= [| P(e, sy—loga(k))= [] exp(-a(k)e”)=exp(-e D;) 


keB \ {j} keB \ {j} 


where 


(3.13) Ds = Dey) AO 


7 In the following the distribution function (3.10) will be called the standard extreme value distribution. 
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Hence 


oo 


(3.14) p(U, =maxU,}=P(U, > V;)=P(e; +loga(j) > V; } j= | P (y>v,)P (U; E(y, y +dy)). 


—00 


Note next that since by (3.10) 
P(U, < y) = P(e, + log a(j) < y) = exp (-e7 a(j)) 
it follows that 


P(e, +loga(j)e(y,y+ dy)) =exp(-e” a(j))a, e” dy. 


Hence 

f P( P(y>V,)P(U, e(y.y+dy))= f exp(-D; e™” Jexp(-a(j)e™ )a(j)e™” dy 
(3-13) =a() | exp(-(D, +a(j))e~ Je dy 

=- mer I 7 exp(-(D,+a9)e”)=— ET 
Since 


D; +a(j)= >) a(k) 


keB 


the result of the Theorem follows from (3.14) and (3.15) 
a QED. 


An interesting question is whether or not there exists other distribution functions than (3.10) 
which imply the Luce model. McFadden (1973) proved that under particular assumptions the answer 
is no. Later Yellott (1977) and Strauss (1979) gave proofs of this result under weaker conditions. 


Yellott (1977) proved the following result. 


Theorem 4 


Assume that S contains more than two alternatives, and U, =loga(j)+€,, where €,,j€S, 


are i.i.d. with cumulative distribution function that is strictly increasing on the real line. Then (3.11) 


holds if and only if £, has the standard extreme value distribution function. 
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Example 3.1 
Consider the choice between m brands of cornflakes. The price of brand j is Zj. We assume 


that the utility function of the consumer has the form 

(3.16) U; =ZB+e\6 

where B <O and o >0 are unknown parameters, &;, j=1,2,...,m, are i.i. extreme value distributed. 
Without loss of generality we can write the utility function as 

(3.17) U, =Z, B/ote,=ZBt+e,. 


From Theorem 3 it follows that the choice probabilities can be written as 


(3.18) P. = 


Clearly, B is identified, since 


P. = 
oe| |= (, -Z,)B. 


l 
However, © is not identified. Note that the variance of the error term in the utility function is large 
when © is large, which in formulation (3.17) corresponds to a small BØ. 


When ßB has been estimated one can compute the aggregate own- and cross-price elasticities 


according to the formulae 





3.19 =6§Z.(1-P 
eee dlog Z, p i( i 
and 
dlog P, S 
(3.20) ———=- BZ, P, 
dlog Z, 
for kj. 
Example 3.2 


Consider a transportation choice problem. There are two feasible alternatives, namely driving 


own car (Alternative 1), or riding a bus (Alternative 2). 
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Let 1 index the commuter and let 


1 if j=l 
| a 
O otherwise, 


Ls j= In-vehicle time, alternative j, 
Z;,, = Out-of-vehicle time, alternative j, 


Zi4j = Transportation cost, alternative j, 


7. = 1 if the household own a car 
5I lg otherwise, 

and 

Zis2 =0. 


The utility function is assumed to have the structure 
U; = Zb + Ei 


where Zij =( Zi ; »Lis; ÆG ; VÆR I €;, and £ are i.i. extreme value distributed, and Ø is a vector of 


ilj? 
unknown coefficients. From these assumptions it follows that the probability that commuter i shall 


choose alternative j is given by 


SÅR (Z;;B) 
> exp (ZP) 


2 
k=] 


(3.21) P. = 


1) 


From a sample of observations of individual choices and attribute variables one can estimate B by the 


maximum likelihood procedure. 
Let us consider how the model above can be applied in policy simulations once B has been 


estimated. Consider a group of individuals facing some attribute vector Zj, j= 1,2. The corresponding 


choice equals 


exp(Z.B 
(3.22) P =— (2,8) 

5 exp (Z,B) 

k=] 
for j=1,2. From (3.22) it follows that 

dlogP 

3.23 -=8. Z,(1-P 
(3.23) Jog Z, B.Z, (1—P;) 
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and 


(3.24) OlosP; O g ZP 
dlogZ, T 5 * 


for k # j. Eq. (3.23) expresses the "own elasticities" while (3.24) expresses the "cross elasticities". 


Specifically, (3.23) yields the relative increase in the fraction of individuals that choose alternative J 


that follows from a relative increase in Z; by one unit. 


3.4. The independent random utility model 


If &,,jeS, are independent then the choice probabilities can be expressed as 
(3.25) P(B)=f TI R-vi)R(y-v;)4y 
keB\ {j} 


where F,(y) = P(e, <y), and BcS. 


To realize that (4.9) hold note that since €,, JES, are independent we get 


P( max U <y}= | () e sy-v)]- I] P(e, Sy-v,)= I] F, (y -v, ). 


keB\ {j} keB\ {j} keB\ {j} 


Furthermore, 


P(U; e(y,y+dy))=F/(y) dy. 


Hence, 


P.(B)= P(U,>, > ma Ue |= | P(y> m m = JEmw= | II F, (y — vy )F/(y) dy. 


Example 3.3. (Multinomial logit) 


Assume that 


(3.26) Fe" 
Then (3.25) yields 
e“ 
(3.27) P. (B) = Ss 
keB 


Example 3.4. (Independent multinomial probit) 





If 
(3.28) Eyo yhe? 
a j y RI y (Jon 
then 
f (1 
(3.29) P(B)= | IJ oly- vi Jexp(-F(y-v,)" ay. 


co keB\ {j} 


It has been found through simulations and empirical applications that the independent probit model 


yields choice probabilities that are close to the multinomial logit choice probabilities. 


Example 3.5. (Binary probit) 
Assume that B= {1,2} and F,(y)=®(yV2 ). Then 


(3.30) P(U, >U, )=®(v, -v,). 
Example 3.6. (Binary Arcus-tangens) 
Assume that B= {1,2} and 


2 


The density (3.31) is the density of the standard Cauchy distribution. Then 


(3.32) P(U, >U2)=5+—Arctg(v, —v,). 


The Arcus-tangens model differs essentially from the binary logit and probit models in that the tails of 


the Arcus-tangens model are much heavier than for the other two models. 


3.5. Specification of the structural terms, examples 


Let Z;= (Z j Zj2> > ik denote a vector of attributes that characterize alternative j. In the absence 


of individual characteristics, a convenient functional form is 


K 
(3.33) v,=Z,B= > Zy By. 
k=] 
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A more general specification, which was already mentioned in chapter 2, is 
K 
(3.34) v,=> R, (Z;,X)By 
k=] 
where R, (z ja X) ,k =1,...,K, are known functions of the attribute vector and a variable vector X 


that characterizes the agent. 


Example 3.7 
If X=(X,,X,) and Z j =( ‘Z al , a type of specification that is often used is 
(3.35) v; =Z Bi +Z B +Z X, B +Z X,B, +Z X, Ps +Z X, Be- 
In some applications the assumption of linear-in-parameter functional form may, however, be too 


restrictive. 


Example 3.8. (Box-Cox transformation): 
Let Z; =(Z.Zj.),Z, >0,k =1,2, 


and 


RA) (7 
(3.36) vj= B; + B2 
œ Q2 


where @,,@,, B, B, are unknown parameters. The transformation 








a — 
(3.37) ite 
04 





y >0, is called a Box-Cox transformation of y and it contains the linear function as a special case 


(a=1).When &—> 0 then 


a 
~] 
2 »logy. 
a 


When &<l1, (y® — 1) Ja is concave while it is convex when @>1. For any a, (y° — 1) Ja IS 


increasing in y. 
A problem which is usually overlooked in discrete choice analyses is the fact that 


simultaneous equation problems can arise as a result of unobservable attributes. Consider again 
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Example 3.7, and suppose that Z;> and X; are not observed. Suppose that we try to deal with the 
missing variable problem by applying Zjı}ı as a proxy for ZjıX2ß4, Xıl as a proxy for Z;2X;Bs and 4; 


as a "proxy" for Z;.X.B; +Z j2B2, where |), j2 and L4;3 are unknown parameters. This corresponds 


to a utility function with error term 
(3.38) Ej =€; +ZaX2B4 -Zit +Zj.X.B; +Z 8, Hj. 


Now if X, and X, are correlated we realize that ej will be correlated with the deterministic term 


(3.39) | vi=Zy (Bi +H, )+Z XB; +X wy +H. 


This simple example shows that simultaneous equation bias may be a serious problem in many cases 
where data contains limited information about population heterogeneity. Note that even if we were 
able to observe the relevant explanatory variables, we may still face the risk of getting simultaneous 
equation bias as a result of misspesified functional form of the detterministic term of the utility 


function. This is easily demonstrated by a similar argument as the one above. 


3.6. Aggregation of latent alternatives 
In this section we shall obtain a characterization of the choice model that may be justified in 
applications that conform to the following general description: For the sake of expository convenience 
we proceed by means of a concrete example. 

_ Consider migration choice: The agent faces a set B of feasible regions. Within region j there 
is a set B; of feasible schooling and/or employment opportunities. The agent’s problem is to choose his 
favorite opportunity. The researcher only observes the choice of region but not the choice within the 


chosen region. The agent is assumed to have the utility function with structure 
(3.40) | U, =v; +E, 


where j=1,2,...,m, indexes the regions and re B j indexes the opportunities within B;. The term vj is 


deterministic and represents the systematic mean utility across all opportunities within B;, while &;,, 


reB, j=1,2,...,m, are 1.id. with cumulative distribution function F. Let n; be the number of 


opportunities in B;. Evidently the (indirect) utility of choosing region j equals 


where 


E, = max €&€. = max € 


i <n. 
reB; ren; 


jr ` 
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Suppose next that F satisfies Condition (A.6) in Appendix A. Then Theorem A3 implies, provided n; 


is large, that 


P(maxe, —log(cn;)s x | exp (-e*) 


rSn; 


which means that 
(3.41) vj +8 =v,t+logn, +logce+eé; 


where &;, j=1,2,...,m, are standard type III extreme value distributed. Thus we obtain fromTheorem 


3 that the probability of moving to region j equals 


exp(v, +logc+logn, } 


3 exp(v, +logc+logn, ) 


P. =P(U, =max U, } 
keB 


keB 

cn. e“ n.e” 
= j E E Re 
Vk Yk 

D n,e > n, € 


keB keB 


If variables that characterize the regions are available these can be utilized to model fn 7 and fy j ig 


The crucial point in the development above is that even if we are only interested in the 
analysis of the choice of region, we can exploit the (theoretical) structure of the choice problem to 


obtain a characterization of the choice model. Specifically, we have demonstrated that aggregating of 


a large number of latent alternative in fact yields ITA. Moreover, the set of latent alternatives {B T are 


represented in the model by the respective sizes In j ig 


3.7. Stochastic models for ranking 
So far we have only discussed models in which the interest is the agent’s (most) preferred alternative. 
However, in several cases it is of interest to specify the joint probability of the rank ordering of 
alternatives that belong to S or to some subset of S. For example, in stated preference surveys, where 
the agents are presented with hypothetical choice experiments, one has the possibility of designing the 
questionaires so as to elicit information about the agents’ rank ordering. This yields more information 
about preferences than data on solely the highest ranked alternatives, and it is therefore very useful for 
empirical analysis. This type of modeling approach has been applied to analyze the potential demand 
for products that may be introduced in the market. | 

The systematic development of stochastic models for ranking started with Luce (1959) and 


Block and Marschak (1960). Specifically, they provided a powerful theoretical rationale for the 
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structure of the so-called ordered Luce model. The theoretical assumptions that underly the ordered 
Luce model can briefly be described as follows. 


Let Pr = (p: „P2 saa.) be the rank ordering of the alternatives in B, where m is the number 
of alternatives in B, and BCS. This means that p; denotes the element in B that has the i’th rank. 
Moreover, let P(p, ) denote the probability that the agent shall prefer rank ordering pg of B, and, 
consistent with the notation above, let P, (B) be the probability that the agent shall rank alternative 1 


on top when B is the set of feasible alternatives. Recall that the empirical counterpart of these 
probabilities are the respective number of times the agent chooses a particular rank ordering to the 
total number of times the experiment is replicated, or alternatively, the fraction of (observationally 


identical) agents that choose a particular rank ordering. 


Definition 4 
The ranking probabilities constitute a random utility model if and only if 


P(p,)= P(U(p,)>U(p2)>..->U(Pn )) 


for BCS, where {U(j), je S}, are random variables. 


Definition 5: Generalized IIA 
The ranking probabilities satisfy the Independence from Irrelevant Alternatives (IIA) property 
if and only if for any BCS 


(3.42) | P(ps)= Po, (B) Po, (B\{01})--- Poy: (Pr Pm I) 


Definition 5 states that an agent’s ranking behavior can (on average) be viewed as a multistage 
process in which he first selects the most preferred alternative, next he selects the second best among 
the remaining alternatives, etc. The crucial point here is that in each stage, the agent’s ranking of the 
remaining alternatives is independent of the alternatives that were selected in earlier steps. In other 
words, they are viewed as "irrelevant". 


We realize that Definition 2 is a special case of Definition 5. 


Theorem 5 
Assume that the ranking probabilities are consistent with a random utility model and that IIA 


holds. Then there exists positive scalars, a(j), jeS, such that the ranking probabilities are given by 


the model, 
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a(p;) 
(3.43) P(e,)= |] =" 
ieB Ds { Do. Pi-t} a(p, ) 


for BCS, where p,= {©}. The scalars, {a(j)}, are uniquely determined up to multiplication by a 


positive constant. 


Block and Marschak (1960, p. 109) have proved Theorem 5, the first part of which is a 
generalization of a result in Luce (1959, p. 72), cf. Luce and Suppes (1965). As an example consider 


the case when B= {1,2,3} and pg =(2,3,1). Then (3.17) reduces to 


a(2) — a83) 


Pee a(1)-+a(2)+a(3) a(1)+a(3) 


The next result shows that (3.17) is consistent with a simple random utility representation. 


Theorem 6 


Assume a random utility model with U(j)=loga(j)+€,, where €,, jes, are i.i.d. with 


standard extreme value distribution function. Then 


(3.44) P(ps)= P(U(e;)>U(p,)>...> U( “=H ae 73] 


Also here we realize that Theorem 1 is a special case of Theorem 6 because the choice 


probability P;(B) is equal to the sum of all ranking probabilities with p, =j. A proof of Theorem 6 is 


given in Strauss (1979). 


3.8. Stochastic dependent utilities across alternatives 

In the random utility models discussed above we only focused on models with random terms that are 
independent across alternatives. In particular we noted that the independent extreme value random 
utility model is equivalent to the Luce model. It has been found that the independent multinominal 
probit model is "close" to the Luce model in the sense that the choice probabilities are close provided 
the structural terms of the two models have the same structure. However, the assumption of 
independent random terms is rather restrictive in some cases, which the following example will 


demonstrate. 
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Example 3.9. (A version of the red-bus/blue-bus problem, Debreu, 1960) 
Consider a commuter choice problem in which there are two transportation alternatives, 
namely "car", (1), "bus", (2). The fraction of commuters that go by car and bus is 1/3 and 2/3, 


respectively. If we assume that Luce’s model holds we have 


l 


a 





P, ({1,2}) = 


l 
a +a, 3 


With a, =1 it follows that a, =2. Suppose now that another bus service is introduced (alternative 3) 


that is equal in all attributes to the existing bus service except that its buses have a different color 
from the original buses. Thus, there are now red and blue buses which constitute two bus 
transportation alternatives. Since the new bus alternative is essential equivalent to the existing bus 


service it must be true that the corresponding response strengths must be equal, i.e., a, =a, =2. 


Consequently, since the choice set is now equal to {1,2,3} we have according to (3.7) that 


l ] 


NH â, OENE e N E pr: 
ee Jer EN 1+2+2 5 


which implies that 


| bo 


P, (1,2,3}) = P, (f1,2,3}) = 


But intuitively, this seems unrealistic because it is plausible to assume that the commuters will tend to 


treat the two bus alternatives as a single alternative so that 


P, ({1,2,3}) ee z 


and 


P,({1.2.3}) =P, (4.23) =2. 


This example demonstrates that if alternatives are “similar” in some sense, then the Luce model is not 


likely to be valid. 


Let us return to the general theory, and try to list some of the reasons why the random terms 
of the utility function may be correlated across alternatives. 


For expository simplicity consider the (true) utility specification 


(3.45) U; =Z) B, +X, Zi B, +X, Zi B3 +8; 
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and suppose that only Zj; and X; are observable. Thus, in practice we may therefore be tempted to 


"resort to the misspecified version 
(3.46) U; =Z,B, +X, ZB +u +e 
where u; has the interpretation 


(3.47) u; =B; Zp EX,, 


(3.48) ej =£; +X, Zp B; -B; Zj EX3, 


and where we now treat the unobservable components X, and Z; as random variable. (In (3.47) and 


(3.48) the mean is taken across the population.) Suppose that £; , j=1,2,..., are independent. By 


(3.48) we get 
(3.49) cov(e; ,€;) =B} Zi Zj VarX,. 


Thus, we realize in this case that the error terms fe; \ are correlated. 


If X, is observable but {Z 2T is not, we may in empirical estimation resort to the specification 


(3.50) U =Z B, +X, Zii B2 +X; D; FE; 
where 
1 =ß; Zio 


In this case we therefore still have independent error terms provided we introduce alternative-specific 


dummies in the deterministic terms of the utilities. 


Finally suppose that {Z ny are observable while X, is not. Then a natural specification would 


be 

(3.51) U; =Z By +X, Zi B2 +Z; Bs +E, 
where 

(3.52) E =£; +X, Z) B3 -Zi B; 
and 


3] 


(3.53) B, =B,EX,. 


Hence we get 
(3.54) cov (£; ,€,)=B3 Z; Zj VarX, 


which demonstrate that we may get interdependent random terms solely from unobserved population 


heterogeneity. 


3.9. The multinomial Probit model 

The best known multinomial random utility model with interdependent utilities 1s the multinomial 
probit model. In this model the random terms in the utility function are assumed to be multinormally 
distributed (with unknown covariance matrix). The concept of multinomial probit appeared already in 
the writings of Thurstone (1927), but due to its computational complexity it has not been practically 
useful for choice sets with more than five alternatives until quite recently. In recent years, however, 
there has been a number of studies that apply simulation methods in the estimation procedure, 
pioneered by McFadden (1989). Still the computational issue is far from being settled, since the 
current simulation methods are complicated and costly to apply in practice. The following expression 


for the multinomial choice probabilities is suggestive for the complexity of the problem. Let h(x;Q) 


denote the density of an n-dimensional] multinormal zero mean vector-variable with covariance matrix 


Q. We have 


(3.55) h(x; Q)= (22)? |a" exp (ar Q7 x) 
where |Q| denotes the determinant of Q. Furthermore 


VOV) vi 


SEN Nat 
(3.56) PL, +E, = max (v; +8,))= Í sis Í siy | E T ;Q) dx,...dx;...dx, ; 
From (3.56) we see that an n-dimensional integral must be evaluated to obtain the choice 
probabilities. Moreover, the integration limits also depend on the unknown parameters in the utility 
function. When the choice set contains more than five alternatives it is therefore necessary to use 


simulation methods to evaluate these choice probabilities. 


3.10. The Generalized Extreme Value model 
McFadden (1978) and (1981) introduced the class of GEV model which is a random utility model that 


contains the Luce model as a special case. He proved the following result: 
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Theorem 7 

Let G be a non-negative function defined over R that has the following properties: 
(i) Gis homogeneous of degree one, 
(ii) lim G(x,,…, Kipp hp SASL aici 


(iii) the k" partial derivative of G with respect to any combination of k distinct components exist, are 
continuous, non-negative if k is odd, and are non-positive if k is even. 


Then, the joint distribution function 


(3.57) F(x) = exp(-G(e,e™,...,e™ )) 


is a well defined multivariate (type III) extreme value distribution function. Moreover, if the random 


terms of the utility function has joint distribution function given by (3.57), then it follows that 


e” 0 ,G(e” ee ) 
(3.58) P(», +e, =mar(y +8,))= I] 


where d; denotes the partial derivative with respect to component j. 


Above we have stated the choice probability for the case where all the choice alternatives in S 
belong to the choice set. Obviously, we get the joint cumulative distribution function of the random 


terms of the utilities that correspond to any choice set B by letting x; =>, for all iB. This 
corresponds to letting v; =— œ, for all iB in the right hand side of (3.58). 


To see that the Luce model emerges as a special case let 


(3.59) CCS. a 
k=l 


from which it follows from (3.32) that 





keB 
Example 3.10 
Let S = {1,2,3} and assume that 
9 
(3.60) G (KiKi) = Xi +(x7°+x3") 
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where 0<0<1. It can be demonstrated that © has the interpretation 


(3.61) corr (€,,€; )=1-6° 


and 


corr (€,,€;)=0, j= 2,3. 


From Theorem 7 we obtain that 
(3.62) | P,(S)= 


and 


(3.63) P.(S)= 
J ey +(e? +e") 
for j= 2,3. If B= {1,2,3}, then 
ey 
3.64 P, (41,2 +) = ——— 
( ) ( }) e“ +e” 


When alternative 2 and alternative 3 are close substitutes @ should be close to zero. By applying 


l'Hôpital's rule we obtain 
° V2 19 v3/0 — 
lim @ log (e +e )= max (v, v3). 


Consequently, when 6 is close to zero the choice probabilities above are close to 


vy 


e 
3.65 P, (S) = mmm 
(3-6) 3) e“ +exp(max(v,,v;)) 
and 


e“ +exp (max(v, »V3 )) 


? 


(3.66) PS) 


for j= 2,3, where I(A) is the indicator function that is equal to one if A is true and zero otherwise, 


provided v, #v,. For v, =v, we obtain 
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e'+e° 
and 
e“ 
(3.68) P(S) = 
2 (e” +e”? 
for ;j= 2,3. 


Consider the red-bus/blue-bus problem on page 31, where v, = v;, which by (3.38), (3.43) 
and (3.68) yield 


P, ({1,2})=1/3 


and 


P, ({1,2,3}) = P; ({1,2,3})= 1/3. 


Thus the model generated from (3.34) with 6 close to zero is able to capture the underlying structure 


of the red-bus/blue-bus problem. 


3.10.1. The Nested multinomial logit model (nested logit model) 

The nested logit model 1s an extension of the multinomial logit model which belongs to the GEV 
class. The nested logit framework is appropriate in a modelling situation where the decision problem 
has a tree-structure. This means that the choice set can be partitioned into subsets that group together 
alternatives having several observable characteristics in common. It is assumed that the agent chooses 
one of the subsets A, (say) in the first stage from which he selects the preferred alternative. The red- 
bus/blue-bus problem has such a tree structure: Here the first stage concern the choice between car 
and bus while the second stage alternatives are "red-bus” and "blue-bus” in case the first stage choice 


was bus. 


Example 3.11 
To illustrate further the typical choice situation, consider the choice of residential location. 
Specifically, suppose the agent is considering a move to one out of two cities, which includes a 


specific location within the preferred city. Let Uj, denote the utility of location keL, within city j, 


j=1,2, where L; is the set of relevant locations within city j. Let U =v, +E; , where 


keL, keL, 


(3.69) (N (Eir SXi) N (€>, <r) FG ene )) 
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and 


o. 
2 J 
(3.70) E e R D z od l 


j=l \keL, 


The structure (3.70) implies that 


(3.71) corr (£,,.€,,)=1-0;, for ræk, 
and 
(3.72) corr(e,,,€;,)=0 for j#i. 


The interpretation of the correlation structure is that the alternatives within Lj are more "similar" than 
alternatives where one belongs to L, and the other belongs to L2. 


Let P;, denote the joint probability of choosing location reL, and city j. Now from Theorem 


7 we get that 


| er pG (e e`? A 
P, =P| U, = max ( max U, | = ——— ——_—_———— 
i=1,2 (reL; G(e™ e") 
(3.73) ` PRAG Pass 
keL, 
a. a 
> [der 
i=] keL; 
where 9;G is the partial derivative of G with respect to component (j,k). Note that we can rewrite 
(3.73) as 
8 
| a Yete] | 
v../8 v 10 
keL; e wo e jro ~j 
(3.74) P, = ES SØER UI yom a P, Sør aT’ 
» ` en keL; keL; 
i=] keL; 
where 
(3.75) eae 
keL; 
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The probability P; is the probability of choosing to move to city j (i.e. the optimal location lies within 


city }). Furthermore 


| P. eve; 
Be er 
keL, 


is the probability of choosing location reL; , given that city j has been selected. We notice that 
P., /P; does not depend on alternatives outside L;. Thus the probability P; can be factored as a 


product consisting of the probability of choosing city j times the probability of choosing r from L;, 
where the last probability has a structure as if L; were the choice set. We realize that it is therefore 
consistent with the Luce model. However, only when 8 =1 are the probabilities P, and P, consistent 


with the Luce model. Graphically, the above tree structure looks as follows: 





Location within Location within 
city one city two 


So far no theoretical motivation for the GEV model has been given, apart from the property 
that it contains the Luce model as a special case. I shall therefore conclude this section by reviewing 


two invariance properties that characterize the GEV class, and discuss their implications. 


Definition 6; The DIM property” 
The utilities JU A i: satisfy DIM if and only if the distribution of max,U; is independent of 


which variable attains the maximum. 


> DIM is an acronym for; Distribution in Invariant of which variable attains the Maximum. 
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Definition 7; The MSD property” 
The utilities fu j \ satisfy MSD if and only if the distribution of max,U; is the same (apart 


from a location shift) as the distribution of U|. 


If the utilities satisfy DIM it means that the indirect utility is not correlated with the utility of 
the chosen alternative. 

This property corresponds to the notion that the indirect utility in the deterministic micro 
theory has prices and income as arguments, but the chosen quantities do not enter as arguments, nor 
do their corresponding direct utility. 

The MSD property is natural, since it implies that the stochastic properties of the utilities are 
invariant under aggregation of alternatives. To realize this suppose that the univers of alternatives is 
divided into subsets of alternatives called "aggregate alternatives". Thus each aggregate alternative 
consists of one or several "basic" alternatives. It is understood that the consumer’s choice of an 
aggregate alternative means that he chooses a basic alternative that belongs to the aggregate one. 
Consequently, the utility of the aggregate alternative must be the maximum of the utilities of the basic 
alternatives within the aggregate one. Under MSD, the utility of the aggregate alternative will 


therefore have the same distribution (apart from a location shift) as the basic utilities. 


Theorem 8 


Assume that U, =v, +€,, where the c.df. F of E=(E,, Ez, En) does not depend on fv, \ 


J ? 


(i) Then F satisfies DIM if and only if 


(3.77) F(Y Yz In) =v (Ghe, e™,..., e> )) 


where G is a homogeneous function and y is a positive function (subject to F being a proper 
distribution function). 
(ii) If €),€2,... have a common cumulative distribution function then F satisfies MSD if and only if 


(3.77) holds. 


A proof of Theorem 8 is given by Robertson and Strauss (1981). 


From (3.77) and Theorem 7 we realize that when w(x)=exp(—x) we obtain the GEV class. 
Strauss (1979) has proved the following result which follows readily from Theorem 8, and 


extends the result of Theorem 7. This result shows that the choice probabilities do not depend on w. 


* MSD is an acronym for; The Maximum utility has the Same Distribution as the distribution of U, + b. 
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Corollary 2 | 
If (3.77) holds then the choice probabilities are given by 


e” 0 ,G(e",e”,...,e"") 
G(e",e”,...,e") 


Thus, from Theorem 7 we realize that the class of models determined by (3.53) is equivalent 


to the GEV class. 


P(v, +e, =max (v, +€, ))= 


Until resently it has not been clear which restrictions on the choice probabilities are implied 
by the GEV class. Dagsvik (1995) proved that the GEV class is very large; in fact the GEV class 
yields no other restrictions on the choice probabilities beyond those following from the random utility 


assumption. 


Theorem 9 


Assume that U, =v; +€,, where the cumulative distribution function F of € does not depend 


on $v, }. If (3.75) holds then IIA holds if and only if 


(3.78) Fe) v| er) 
k=] 
where @>O0 is an arbitrary constant. 


A proof of Theorem 9 is given by Strauss (1979). 


From (3.78) we realize that when w(x)=exp(—x) we obtain the independent extreme value 


model. 


Example 3.12 


Another example is obtained when 


in which case (3.78) yields 


] 


———. 
1+ ` ETS 
k=l 


(3.80) F(Y Ya-Ya) = 
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Example 3.13 
Assume that 
(3.81) W(x) = exp (—x"} 


with œ >1. Then (3.78) implies that 


k=] 


: 3 I/o 
(3.82) 565.) em —(X oe } 


In this model it can be demonstrated that 


(3.83) corr (e,,£;)=1-—y 


which shows that the Luce model is consistent with a random utility model with any correlation 


(different from zero and one) between the utilities as long as the correlation structure is symmetric. 
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4. More advanced examples of discrete choice analysis 


4.1. Labor supply (D 
Consider the binary decision problem of wanting to work or not. Take the standard neo-classical 
model as a point of departure. Let V(C,L) be the agent’s utility in consumption, C, and annual leisure, 


L. The budget constraint equals 

(4.1) C=hwW +! 

where W is the wage rate the agent faces in the market, h is annual hours of work and I is non-labor 
income (for example the income provided by the spouse). The time constraint equals 

(4.2) | h + L < M (=8760). 

According to this model utility maximization implies that the agent supplies labor if 


DVLM) y 


4. = 
oe 3, V(I,M) 


where dj denotes the partial derivative with respect to component j. If the inequality is reversed, then 
the agent will not wish to work. W’ is called the reservation wage. Suppose for example that the 


utility function has the form 


L Q- 
(=) = 
(c= -1) | M | : 
(4.4) V(C, L) =| ————— |B, + 8. M” , 
] 2 
where a, <1, a, <1, B, >0, B, >0. Then V(C,L) is increasing and strictly concave in (C,L). The 
reservation wage equals 


i _ 92V(LM) _ Bo 1-4; 


4.5 = 
SØ d,VU,M) B, 


After taking the logarithm on both sides of (4.3) and inserting (4.5) we get that the agent will supply 
labor if 


(4.6) i log W > (1 —O, )log I + log (£) (5.6) 


1 


Suppose next that we wish to estimate the unknown parameters of this model from a sample of 


individuals of which some work and some do not work. Unfortunately, we cannot base the estimation 
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procedure immediately on (4.6) because the wage rate is not observed for those individuals that do not 
work. For all individuals in the sample we observe, say, age, non-labor income, length of education 
and number of small children. We assume that the parameter B/B, depend on age and number of 


j 


small children, X-, such that 


(4.7) log E) =X,b+£, 
l 


where &2 is a random term which accounts for unobserved variables that affect the preferences and b is 
a parameter vector. To deal with the fact that the wage rate is only observed for those agents who 


work, we shall next introduce a wage equation. Specifically, we assume that 


(4.8) log W=X,at+eé, 


where X; consists of length of education and age and a is the associate parameter vector. €, is a 
random variable that accounts for unobserved factors that affect the wage rate, such as type of 
schooling, the effect of ability and family background, etc. For simplicity we assume that œ; is 


common to all agents. If €, and €2 are independent and normally distributed with Ee j= 0, 


Vare; = oj , we get that the probability of working equals a probit model given by 


(4.9) P, =P(wow' ol Cadet 


Jo? +05 
where ®(-) is the cumulative normal distribution function and s is a parameter vector such that 
Xs=X,a—X,b. From (4.9) we realize that only 


7 i 
ae ee ee E LER 


Jo +o? Jor+o2 
can be identified. 
If the purpose of this model is to analyze the effect from changes in level of education, family 


size and non-labor income on the probability of supplying labor then we do not need to identify the 


rest of the parameters. Let us write the model in a more convenient form; 
(4.10) i P, =@(Xs" —clogl), 


where c=(1-0.,)/0? +03 and s; =s,/ o? +63. We have that 
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c] 1” 
ns _ (Xs clogI) 


; j + PA 
bite dlogP, __ A T |) 
dlog I ®(Xs* —clog!) ®(Xs° —clog1)V2T 


Ea. (4.11) equals the elasticity of the probability of working with respect to in non-labor income. 
Suppose that the random terms £, and €, are i.i. standard extreme value distributed. Then it 


follows that P becomes a binary logit model given by 


exp(Elog W) 1 


(4.12) Praen LEAST DREGE URES EEE 
i exp(ElogW)+exp(ElogW") 1+exp(—X$ + € log!) 


where I=s" n/ J3 and €=c m/ J3 . From (4.12) we now obtain the elasticity with respect to I as 


dlogP, _ c 


4.13 =—c(1- P, )= — m. 
i dlog I ee) 1+exp(X$ -č logI) 


4.2. Labor supply (ID 
Consider the choice of whether or not to work. The agent is assumed to face a set B of feasible jobs 
where job j has wage rate W;. The set B is unobservable to the econometrician. The econometrician 


only observe if the agent works or not and the corresponding wage rate if the agent works. Let 
(4.14) U, =8logW, +€;, jeB 


be the utility of job j, where £; is supposed to account for non-pecuniary aspects with job j, and 


0 >0 is a parameter. The utility of not working equals 
(4.15) Uo =Vo + Eo 


where vo is a structural term and & is a random variable. In (4.14), W; is possibly correlated with £j 


and we therefore introduce an instrument variable equation 
(4.16) log W, = XB+n, 


where X is a vector that consists of individual characteristics such as length of education and 


a P 5 * 
experience, and Nj is a zero mean random term that may be correlated with €; . However, we assume 


that nj and £, are independent when k + j. When (4.16) is inserted into (4.14) we get 


(4.17) U,=0XB +e, 
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where €; =£} +0; . Let n be the number of jobs in B. If we assume that €,, j = 0,1,2,...,n, are i.i. 


standard extreme value distributed then the probability of choosing job j equals 


@ OXB @ OXB 
4.18 P U, =max(U , max U )) se is 
om ( O? keB "| e+) eB eg +n eX? 
keB © 


Hence the probability of working (which is the probability of choosing one of the jobs in B) equals 


exp 

ne 
(4.19) pe, 
7% 4 @ OXB 


Suppose n depends on regional and/or group-specific unemployment rate, Z, in the following manner 


(4.20) logn=pZ+6 


where p and 6 are unknown parameters. Then P, takes the form 


l 


4.21 P, = =. 
en É 1+exp(v, —8—pZ — XB0) 


Consider next the estimation of (4.16) from the subsample of working individuals. Since 


€j =£; +6n, it follows that the mean of nj is not necessarily equal to zero, given that j is the chosen 


alternative, 1.e., 


U; = max (U, ax U, )) #0. 


e[n, 





Define 7, by 
(4.22) N, =e, +N; 
where & is an unknown parameter that is equal to 


(4.23) a= cov(n;,€, )/Vare,. 


This implies that £j and 7 ; are uncorrelated. Moreover, we have, by Lemma I in Appendix A that 





E É j 
(4.24) 


= Emax (U,,max U, )- XB6. 


U, =max (Umax U, }}=B(U,|U, = max (Us max U, )] -x86 
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Under the assumption of extreme value distributed utility terms we get 


(4.25) Emax (u. » max U, ) = log 2 eP 4 evo + 0.5772 = log (n ere te“ )+ 0.5772. 


keB 


Hence, by combining (4.25) and (4.22) we get 


e[n, 


(4.26) = a log(n e*%™® +e" )- aXBO + 0.57720 


U, = max (U, pax U, )]}=08(e, U, = max (U, max U, )) 








=—Q logP, +a logn + 0.5772 -a = —- a log P, +apZ + aô + 0.5772-a. 
Consequently, we can write the wage equation as 
(4.27) log W; = XB-a.logP, +apZ +8 +7; 


where 86 =05+ 0.5772-0 and nj is a random term with the property that 


(4.28) efn; | U; =max (Up, max U«))=0. 


Thus we can estimate (4.27) consistently from the subsample of working individuals. 


Consider finally the conditional variance 
Var{n, | U, = max (Uo axu,)) 
From Lemma 1 in Appendix A we get 
Var (e i 


(4.29) -Var (v, 


U; = max (Uo, max U«)) 





U, = max (Uo max U, )) 





= Var( max (Uo, max Uk )) = Var€,. 
keB 
The last equality in (4.29) follows from the fact that 
m (Uo; kB Us) 
has the same distribution as £j, apart from an additive deterministic term. Consequently, since € and 


n ; are independent, 
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(4.30) Var{n,|U, = max (U, max U, )) = Vari +a? Vare, = Varn). 
keB 


The last result shows that in contrast to the case with normally distributed disturbances, (cf. Heckman, 
1979) the conditional variance of nj given that j is the chosen alternative equals the corresponding 


unconditional variance. 


4.3. Labor supply (IID) 
Consider an alternative modelling framework to the one discussed in section 4.2. We assume that the 


agent faces a set B (unobservable) of feasible job opportunities. Let 
(4.31) U, =v(W,)+e 


j=1,3,...,n, be the utility of job j with wage rate Wj, where v(W, } is the structural part of the utility 


function that is common to all agents, while £j is an agent-specific random term that accounts for non- 


pecuniary aspect associated with job j. Similarly, let 


(4.32) Up = Vo + Eq 
be the utility of not working. Suppose furthermore that €,,j=0,1,..., are i.i. standard extreme value 


distributed. | , 
Let B(w) be the subset of B that consists of all feasible jobs with wage rate w, and let n(w) be 
the number of jobs in B(w), and let D be the set of all possible wages. The probability of choosing job 


j in B equals 


v(w;) 


e 
e`’ +) eV) 


(4.33) se 
v(w;) aN 


€ am 
ery ye e“? +) n(yye" 


-yeD keB(y) yeD 


P. =P(U, = max (U,, max G) 


J 


Hence the probability of choosing a job with wage rate w equals 


jeB(w) 


Pw. S Pea RR 
2 J evo +) n(y)e 
(4.34) yeD 


v(w) v(w) 


n(w)e e 


evo +) n(y)e"” e” +) e Vy) 


yeD yeD 
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where 


(4.35) | V(y) = log n(y) + v(y). 


From (4.35) we realize that we cannot without further assumptions separate n(w) from v(w). 
To this end suppose that the agent also receives nonlabor income. For example, a married woman or 


man may receive income from the spouse. In this case 


(4.36) v(w)=v (w+I) 


. * . . 
where I denoted nonlabor income, and v (-) is a concave parametric function. 


4.4. Transportation 

Suppose that commuters have the choice between driving own car or taking a bus. One is interested in 
estimating a behavioral model to study, for example, how the introduction of a new subway line will 
affect the commuters’ transportation choices. Consider a particular commuter (agent) and let U;(x) be 


the agent's joint utility of commodity vector x and transportation alternative j, j}=1,2. Assume that the 


utility function has the structure 

(4.37) U;(x)=U,, +U, (x). 
The budget constraint is given by 

(4.38) px=y-q,;,x20, 


where p is a vector of commodity prices and q; is the per-unit-cost of transportation. By maximizing 


U;(x) with respect to x subject to (5.39) we obtain the conditional indirect utility, given j, as 


(4.39) | V;(p.y -4)=U,; + Vo(p.y —q;) 
where 
(4.40) V,(p,y)= max U, (x). 


Assume that 


(4.41) U, =BlogT; +€; 


where T; is the travelling time with alternative j, B is an unknown parameter and fe ip are random 


terms that account for the effect of unobserved variables, such as walking distances and comfort. We 


assume that £; and € are i.i. standard extreme value distributed. Assume furthermore that 


47 


(4.42) V,(p,y — 4) = V3(p) + Olog (y — 4g, } 
where 9>0 is an unknown parameter. The assumptions above yield 
(4.43) V; =BlogT; + @log (y -qj ) TE, 


which implies that 


5 exp (B log T, +06 log (y — ax )) 


k=] 


(4.44) P;({1,2})= 


for j=1,2. After the unknown parameters B and 0 have been estimated one can predict the fraction of 


commuters that will choose the subway alternative (alternative 3) given that T; and q; have been 
specified. Here, it is essential that one believes that T; and q; are the main attributes of importance. 


We thus get that the probability of choosing alternative j from {1,2,3} equals 


exp (B log T; +8 log (y -q;)) 
y exp (Blog T, +68 log (y — qi) 


k=] 


(4.45) P, ({1,2,3})= 


4.5. Firms' location of plants (I) 

In this example we outline a framework for analyzing firms’ location of plants. Specifically, we 
assume that the firms face the choice of establishing a plant in one of m differents sites (counties). 
Suppose furthermore that firms profit functions (or expected profit functions) depend on observable 
characteristics that are common for all sites within particular regions. Let C, denote the set of counties 
within region r, r=1,2,...,m, and let n, be the number of counties in C,. The regional attributes of 
interest may be population density and macro indicators that describe the industry structure. Finally, 


certain tax rates may differ across regions (tax shelters). Consider an arbitrarily selected firm. Let 
U, =v, +€, denote the firms utility of establishing a plant in county jeC,, where fe at are 1.1. 
standard extreme value distributed terms that account for unobserved region and county-specific 


attributes and fv, } are structural terms that depend on the attributes specific to region r. Let P,, be the 


probability of a location in county j in region r. We get 


(4.46) l P =?(U, = max (max U; |< -= 


i keC = , m y 
Ege Fae 


i=l ke i i=] 


i 


Hence, we get that the probability of a location within region r equals 
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(4.47) N E 


where 


(4.48) vV, =v, +logn,. 


If we assume that v, = ZB , where Z, is the vector of observable attributes associated with region r, 


we get 


(4.49) Vv. =Z,B+logn,. 


4.6. Firms' location of plants (ID 

We now consider an extension of the setting in Section 4.5. Suppose now that the error terms for 
counties within a common region are correlated. This may be a plausible assumption since it is often 
the case that counties within regions are more homogeneous than counties across regions. We shall 


now apply the nested logit framework to model this case. Let 


(4.50) a=) y Yi 
r=} \ j=l 
and let 
F(x) = exp (-G(e"" J. es ...)} 


€ ). Then it follows that 


ml>? = mnn 


be the joint distribution function of (e,, ae ace eee 


(4.51) cor (€,;,€,)=1-67 


for i#j, 1,jE€C,, and 


(4.52) corr (€,,,€,)=0 


for i€C,,jeC,,r#s, where 0<0<1. From Theorem 7 we get 
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(4.53) pat I oE. I 
m Vi … 1/0 n, 
5 5 ev /0 > e ni; 
i=] jeC, i=] 
which yields 
Ve » 1/8 Ve 
(4.54) P = ` P, 2 = £ 
jeC, ey: nv? ` ev 
i=] i=] 
where 
x 1 l 
(4.55) v, =v, +—logn, =Z,B+—logn,. 
8 8 
Provided n,,n,,..., are known we can estimate Ø and 1/6 from observations on plant locations with 


IZ, , log n, } as explanatory variables. 


From (4.54) we get 








(4.56) cles? L La-p,) 
dlogn, © 
and 
(4.57) ls. lp 
dlogn, (9, 


for kzr. The interpretation of (4.56) and (4.57) is as the effect from increasing the size of C,. For 
example, one may wish to assess the effect of changing the number of counties that belong to a region 


with "tax shelters”. 


4.7. Firms' location of plants (ITI) 


The setting here is the same as the one in Section 4.6. Suppose now that {n, } are unobservable, but 


that we observe the number of locations in at least one county in each region, say in county number 


one. Let M, be the observed number of locations in county one in C,, and let M, be the total number 


of observed locations within region r. Finally, let M =) M,. Then M,,/M, is an estimate of P, 


r=] 


and M,/M is an estimate of P,. Since by (4.53) 


50 


Pa =P, -— 
n, 
it follows that consistent estimates for n, is given by 
. _ Mi 
(4.58) | n, = —, r=1,2,...,m. 





4.8. Potential demand for alternative fuel vehicles 

This example is taken from Dagsvik et al. (1996). To assess the potential demand for alternative fuel 
vehicles such as; "electric" (1), "liquid propane gas" (lpg) (2), and "hybrid" (3), vehicles, an ordered 
logit model was estimated on the basis of a "stated preference" survey. In this survey each responent 
in a randomly selected sample was exposed to 15 experiments. In each experiment the respondent was 
asked to rank three hypothetical vehicles characterized by specified attributes, according to the 
respondent’s preferences. These attributes are: "Purchase price", "Top speed", "Driving range between 
refueling/recharging", and "Fuel consumption". The total sample size (after the non-respondent 
individuals are removed) consisted of 662 individuals. About one half of the sample (group A) 
received choice sets with the alternatives "electric", "Ipg", and "gasoline" vehicles, while the other 
half (group B) received "hybrid", "Ipg" and "gasoline" vehicles. In this study "hybrid" means a 
combination of electric and gasoline technology. The gasoline alternative is labeled alternative 4. 


The individuals’ utility function was specified as 
(4.59) U,H=Z, (OB +p; +8;(t) 


where Z;(t) is a vector consisting of the four attributes of vehicle j in experiment t, t=1,2,...,15, and 
uj and B are unknown parameters. Without loss of generality, we set y, =0. As mentioned above 

group A has choice set, C, = {1,2,4}, while group B has choice set, C, = {2,3,4}. Let Pij(C) be the 
probability that an individual shall rank alternative i on top and j second best in experiment t, and let 


Y; (t)=1 if individual h ranks i on top and j second best in experiment t, and zero otherwise. From 


section 3.4 it follows that if %£.(t) + are assumed to be 1.1. standard extreme value distributed then 
J 


exp(Z;(t)B+u,)  exp(Zi©B+u;) 
> exp(Z,(t)B+u,) > exp(Z,(t)B+u,) 


reC reC\{i} 


(4.60) Pi. (C) = 


where C is equal to C4 or Cg,. We also assume that the random terms fe j (t)} are independent across 


experiments. Consequently, it follows that the loglikelihood function has the form 
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15 
(4.61) t by 2 3, Yilga CHA ¥ L woe, C)| 


t=l \heA i Jj heB i j) 
The sample is further split into six age and gender groups, and Table 4.1 displays the estimation 


results for these groups. 


Table 4.1. Parameter estimates” for the age/gender specific utility function 


Age 
— 18-29 30-49 50- 
Attribute | Females Males | Females Males | Females 


Purchase price (in 100 000 NOK) -1.550 
(-11.9) 

Top speed (100 km/h) -0.320 
(-1.1) 

Driving range (1 000 km) 0.140 
(0.2) 

Fuel consumption (liter per 10 km) -0.446 
(-1.5) 

Dummy, electric 0.765 
(3.6) 

Dummy, hybrid 1.216 
(7.7) 

Dummy, lpg 0.698 
(5.7) 

# of observations 1290 
# of respondents 86 
log-likelihood 2040.9 
McFadden’s p” 0.12 





” t-values in parenthesis. 


Males 


-1.394 
(-11.8) 


-0.339 
(-1.2) 


1.000 
(1.8) 


-1.030 
(-3.7) 


-0.195 
(-1.0) 


0.666 
(4.6) 


0.676 
(5.6) 


1455 
96 
2333.8 
0.10 


Table 4.1 displays the estimates when the model parameters differ by gender and age. We 


notice that the price parameter is very sharply determined and it is slightly declining by age in 


absolute value. Most of the other parameters also decline by age in absolute value. However, when we 


take the standard error into account this tendency seems rather weak. Further, the utility function does 


not differ much by gender, apart from the parameters associated with fuel-consumption and the 


dummies for alternative fuel-cars. Specifically, males seem to be more sceptic towards alternative-fuel 


than females. 
To check how well the model performs, we have computed McFadden’ sp” and in additi 


have applied the model to predict the individuals’ rankings. The prediction results are displayed 
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on we 


in 


Tables 4.2 and 4.3, while McFadden’sp” is reported in Table 4.1. We see that McFadden’ s p° has the 


highest values for young females, and for males with age between 30-49 years. 


Table 4.2. Prediction performance of the model for group A. Per cent 


Gaso- Gaso- Gaso- 
Gender Electric Lpg line | Electric Lpg line Electric Lpg line 






Females: 

Observed 22.3 46.5 25.6 27.4 46.9 
Predicted 32.8 38.5 21.6 25.3 53.2 
Males: 

Observed 20.3 43.5 39.7 22.0 38.3 
Predicted 32.1 35.5 35.3 20.3 44.3 





Table 4.3. Prediction performance of the model group B. Per cent 


| Gaso- Gaso- Gaso- 
Gender Hybrid Lpg line Hybrid Lpg line Hybrid Lpg line 





Females: 

Observed 42.0 13.0 44.9 22.1 22.0 13.1 64.9 
Predicted 40.3 16.7 37.8 25.3 20.1 21.9 58.0 
Males: 

Observed 46.2 15.7 41.0 26.2 29.0 12.8 58.1 
Predicted 45.2 19.5 35.0 27.6 27.3 19.8 52.9 


The results in Table 4.3 show that for those individuals who receive choice sets that include 
the hybrid vehicle alternative (group B) the model fits the data reasonably well. For the other half of 
the sample for which the electric vehicle alternative is feasible (group A), Table 4.2 shows that the 
predictions fail by about 10 per cent points in four cases. Thus the model performs better for group B 


than for group A. 


4.9. Oligopoly with product differentiation 
This example is taken from Anderson et al. (1994). Consider n firms which each produces a variant of 
a differentiated product. The firms’ decision problem is to determine optimal prices of the different 
variants. 

Assume that firm j produces at fixed marginal costs c; and has fixed costs K;. There are N 


consumers in the economy and consumer i has utility 
(4.62) | Uj; =y; +a; -w, +08. 
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for variant j, where y; is the consumers income, a; is an index that captures the mean value of non- 
pecuniary attributes (quality) of variant j, wj is the price of variant j, €j is an individual-specific 
random taste-shifter that captures unobservable product attributes as well as unobservable individual- 
specific characteristics and © > 0 is a parameter (unknown). If we assume that €;,j=1,2,....n, 
1=1,2,...,N, are i.i. standard extreme value distributed we get that the aggregate demand for variant j 


equals NP; where 





(4.63) | P, =Q;(w)= 


Assume next that the firm knows the mean fractional demands {Q j (w)} as a function of prices, w. 


Consequently, a firm that produces variant j can calculate expected profit, 1, conditional on the 


prices; 
(4.64) nj =(w,-c;)NQ,(w)-K;. 


Now firm j takes the prices set by other firms as given and chooses the price of variant j that 


maximizes (4.64). Anderson et al. (1994) demonstrate that there exists a unique Nash equilibrium set 


of prices, w" =(w; Wands wa) which are determined by 


(4.65) E +—— 


j KW 
Thus, when estimating the model (4.63) one should take into account the additional restrictions 


determined by (4.65). 


4.10. Social network 

This example is borrowed from Dagsvik (1985). In the time-use survey conducted by Statistics 
Norway, 1980-1981 the survey respondents were asked who they would turn to if they needed help. 
The respondents were divided into two age groups, where group (i) and (ii) consist of individuals less 
than 45 years of age and more than 45 years of age, respectively. Here, we shall only analyze the 
subsample of individuals less than 45 years of age. The univers of alternatives S consisted of five 


alternatives, namely 


S = {Mother (1), father (2), brother (3), sister (4), neighbor (5)}. 
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However, the set of feasible alternatives (choice set) were less for many of the respondents. 
Specifically, there turn out to be 11 different choice sets in the sample; B,,B,,...,B,,. The data for 
each of the 11 groups are given in Table 4.5. Group (i) consists of 526 individuals. 

The question is whether the above data can be rationalized by a choice model. To this end we 


first estimated a logit model 


e` 
e 


keB,. 


(4.66) P (B, )= 


where k =1,2,...,11, and v, = 0. Thus this model contains four parameters to be estimated. Let P. 


be the observed choice frequencies conditional on choice set B,. Let €" denote the loglikelihood 


A 


obtained when the respective choice probabilities are estimated by P,;, j€ B. From Table 4.5 it 


follows that 2° =— 405.8. In the logit model there are four free parameters, while there are 24 “free” 


probabilities in the 11 multinomial models in the a priori statistical model. Consequently, if 2, denotes 
the loglikelihood based on the logit model it follows that —2 (4, —£ ") is (asymptotically) Chi squared 


distributed with 20 degrees of freedom. Since the corresponding critical value at 5 per cent 
significance level equals 31.4 it follows from estimation results reported in Table 4.4 that the logit 
model is rejected against the non-structural multinomial model. One interesting hypothesis that might 
explain this rejection is that alternative five ("neighbor") differs from the "family" alternatives in the 
sense that the family alternatives depend on a latent variable which represents the "family aspect", 
that make the family alternatives more "close" than non-family alternatives. As a consequence, the 
family alternatives will have correlated utilities. To allow for this effect we postulate a nested logit 


structure with utilities that are correlated for the family alternatives. Specifically, we assume that 
corr (U;,U;)= 107. 


for i#j, i,j#5, and 


corr(U;,U,; )=0, 
for 1<5, where 0<@<1. This yields 
N 
keB 


when 5¢B, 
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6-1 
“| ` z 


(4.68) | P;(B) = a 5 
e“ + ` e`:/9 
keB\ {5} 
when j#5, 5eB, and 
e`s 


(4.69) P, (B) = 


—— r. 
eo + ` eY:/® 
keB\{5} 


The parameter estimates in the nested logit case are also given in Table 4.4. We notice that 


As above we set v, =0. 


while only v; and v4 are precisely determined in the logit case all the parameters are rather precisely 
determined in the nested logit case. The estimate of 8 implies that the correlation between the utilities 


of the family alternatives equals 0.79. 
From Table 4.4 we find that twice the difference in loglikelihood between the two models 
equals 17.6. Since the critical value of the Chi squared distribution with one degree of freedom at 5 
per cent level equals 3.8, it follows that the logit model is rejected against the nested logit alternative. 
As above we can also compare the nested logit model to the non-structural multinomial 


model. Let £, denote the loglikelihood of the nested logit model. Since the nested logit model has 
five parameters it follows that —2 (0 2—£ ") is (asymptotically) Chi squared distributed with 19 
degrees of freedom. The corresponding critical value is 30.1 at 5 per cent significance level and 
therefore the estimate of —2 (2 ,—¢£ ”) in Table 4.4 implies that the nested logit model is not rejected 


against the non-structural multinomial model. Thus, in terms of goodness-of-fit there seems to be an 
essential difference between the logit and the nested logit formulation. However, as measured by 


McFaddens p°, the difference in goodness-of-fit is only one per cent! This shows that one should be 


very cautious when interpreting p°. 
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Table 4.4. Parameter estimates 


Logit model Nested logit model 


Vi 31.8 
V> 5.5 
V3 8.3 
V4 16.8 
0 15.0 


loglikelihood £, 
McFadden’s p? 


-2(¢,-£") 





In Table 4.5 we report the data and the prediction performance of the two model versions. The 


table shows that the nested logit model predicts the fractions of observed choices rather well. 
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Table 4.5. Prediction performance of the logit- and the nested logit model 








Choice 1 2 3 4 5 # obser- 
sets Mother Father Brother Sister Neighbor vations 

Observed 30 NF NF NF 6 36 
B; Predicted Logit 32.1 NF NF NF 3.9 

Predicted Nested logit 31.4 NF NF NF 4.6 

Observed NF NF 36 NF 20 56 
B2 Predicted Logit NF NF 29.4 NF 26.6 

Predicted Nested logit NF NF 38.6 NF 17.3 

Observed 21 NF 2 NF 1 24 
B; Predicted Logit 19.2 NF 2.5 NF 2.3 

Predicted Nested logit 19.4 NF 1.5 NF 2.9 

Observed NF NF 9 21 2 32 
B4 Predicted Logit NF NF 8.5 15.8 7.7 

Predicted Nested logit NF NF 7.0 18.6 6.4 

Observed NF 5 NF NF 2 7 
Bs Predicted Logit NF 2.6 NF NF 4.4 

Predicted Nested logit NF 4.6 NF NF 2.4 

Oserved 65 3 NF NF 10 78 
Be Predicted Logit 65.4 4.7 NF NF 7.9 

Predicted Nested logit 64.5 3.9 NF NF 9.6 

Observed 50 . 4 4 NF 6 64 
B- Predicted Logit 48.3 3.5 6.4 NF 5.8 

Predicted Nested logit 49.2 3.0 4.1 NF 7.1 

Observed 23 NF NF 7 8 38 
Bg Predicted Logit 27.8 NF NF 6.9 3.3 

Predicted Nested logit 27.5 NF NF 6.0 4.4 

Observed 45 2 NF 5 8 60 
Bo Predicted Logit 41.7 3.0 NF 10.3 5 

Predicted Nested logit 41.5 2.5 NF 9.1 6.8 

Observed 21 NF 2 6 8 37 
Bio Predicted Logit 24.7 NF 3.3 6.1 3.0 

Predicted Nested logit 25.2 NF 2.1 5.5 4.2 

Observed 64 4 5 15 6 94 
Bıı Predicted Logit 60.0 4.3 7.9 14.8 T2 

Predicted 61.3 3.7 5.1 13.4 10.5 





NF = Not feasible. 


Nested logit 


Alternatives 


5. Discrete/continuous choice 


5.1. The nonstructural Tobit model 

In this section we shall describe a type of statistical model, usually called the Tobit model. The Tobit 
model (Tobin, 1958) is motivated from the latent variable specification (2.7), in section 2.1.1, but in 
contrast to the case described there we now also observe the left hand side variable when it is positive. 


Thus we observe Y defined by 


XB+uo if XB+u0>0 
(5.1) y= p i B 
O otherwise, 


where 6 >0 is a scale parameter, and u is a zero mean random variable with cumulative distribution 


function F(-). Another way of expressing (5.1) is as 


(5.2) Y = max (0, XB + uo). 


Tobin (1958) assumed that u is normally distributed N(0,1), but it is also convenient to work with the 
logistic distribution. | 
An example of a tobit formulation is the standard labor supply model. Here we may interpret 


XBc+uoc as an index that measures the desire to work of an agent with characteristics X. When this 
index is positive, the desired hours of work is typically assumed proportional to XBc+uoc where 1/c 


is the proportionality factor. The variable vector X may contain education, work experience, and the 
unobservable term u may capture the effect of unobservable variables such as specific skills and 


training. When the index XBc+uoc is negative and large, say, it means that the agent has strong 


preference for leisure. Since the actual hours og work always will be non-negative we therefore get 


the structure (5.1). 


5.2. The general structural setting 

Models such as the Tobit one account for some of the statistical nature of the data, but is not 
structural in a "deep" sense. We shall now discuss structural specifications derived from choice 
theory. In many situations a decision-maker makes interrelated choices where one choice is discrete 
and the other is continuous. For example, a worker may face the decision problem of which job to 
choose and how many hours to work, (conditional on the choice of job). Another example is a 
consumer that-considers purchasing electric versus gas appliances, as well as how much electricity or 
gas to consume. A third example is a household that chooses which type of car to own and the 


intensity of car use. 
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Such choice situations are called discrete/continuous, reflecting the fact that the choice set 
along one dimension is discrete while it is continuous along another dimension. Theories and methods 
for specifying and estimating structural models for discrete/continuous choice have been developed 
among others by Heckman (1974, 1979), Dubin and McFadden (1984), Lee and Trost (1978), King 
(1980) and Dagsvik (1994). 

We now consider an agent that faces two choices; first which alternative to choose, from a 
finite and exhaustive set of mutually exclusive alternatives, and second; how much of a particular 
good to consume. Since it is often the case that these choices depend on the same underlying factors 
this should be taken into account in the formulation of the model and in the corresponding 


econometric specification. Suppose for expository simplicity that there are only two continuous 


goods. Let U, (x,,X,) be the utility of alternative (j,x,,X,), where j=1,2,..., indexes the discrete 
alternatives and (x, , X, ) the continuous ones. Thus the agent’s optimization problem is to maximize 


U,(x,,x,) with respect to (j,x,,x,) subject to the budget constraints j¢B and 


(5.3) Xi Ppi +X pt+> 5, Cc, =y, x, 20, x, 20, 
k 


where B is the choice set of feasible (discrete) alternatives, p,,p., are prices, y is the agent’s income 
(exogenous), cj is the cost (or annual user cost) of the discrete alternative j and ô, =1 if alternative 
k €B is chosen and zero otherwise. Consider now the continuous choice given the discrete alternative 


j. Let 


(5.4) V,(p.y-c))= max —_ U,(x,,x,) 
XıPı + X2P2 =y-c, 
x120,x2 20 


which means that V, (p, y-c j) is the conditional indirect utility, given that the discrete alternative j is 


chosen. Since V; (p. y-c j) expresses the highest possible utility conditional on alternative j, it must 


be the case that alternative j is chosen if 


(5.5) Vj(P y- c;)= max V, (Py -cx ). 


Second, it follows from Roy’s identity that under standard regularity conditions we obtain the 


corresponding continuous demands by 


dV,(p,y—c;)/op, 


“WV (pare, ay 
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for r=1,2, given that j is the preferred discrete alternative, i.e., given that (5.5) holds. Thus the 
discrete as well as the continuous choices are here derived from a common representation of the 
preferences. 

It is known from duality theory that under standard regularity conditions the specification of 
the indirect utility is equivalent to the specification of the corresponding direct utility. Therefore, in 
econometric model building, it is convenient to start with a parametric functional form of the indirect 


utility function, including alternative-specific random terms. 


5.3. The Gorman Polar functional form 
When the conditional indirect utility function belongs to the class of functional forms called "Gorman 
Polar forms", (Gorman, 1953), then the structure of the demand equations and choice probabilities 


become particularly convenient. The Gorman Polar functional form is given by 


y-¢, +a(p) (£; +m, | 
(5.7) V.=V.(p,y -c }=——-——. 
= Vi(y~ 23) b(p) 
where a(-) and b(-) are functions that are homogeneous of degree one, concave and non-decreasing in 


p and {m i} are alternative-specific terms which are independent of prices and income. It then follows 


that Vj is non-increasing and convex in prices. Here fe ;} are random terms that are supposed to 


account for unobservables that affect preferences and m; is (possibly) a function of observable 
attributes associated with alternative j. 


From (5.7) it follows that the choice probabilities are given by 





ss nosèl tma) 


In case {e j i; are 1.1. extreme value distributed we obtain 


exp(m, =c; /a(p)) 


5.9 P. (B) = s. 
(5.9) j(B) y exp(m, —c,/a(p)) 


By Roy's identity we obtain the demands as 


a(p) b, (p) b, (p) | a(p)b,(p) _ 


where a,(p) and b,(p) denote the respective partial derivatives with respect to component r. 
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Recall, however, that due to the selectivity problem we cannot automatically apply standard 


methods to estimate (5.10), as we shall discuss in further detail below. 


Example 5.1 


Assume that the conditional indirect utility function has the form 


(5.11) V;(p.y -c,)= log|(Z,0.+B,.p, +B oP. +0(y -c;)+e,)e™' |-B, logp, 


where fe j} are i.i. standard extreme value distributed random terms which have mean 0.5772 and Q, 


Bio Bs, r=1,2, 9 and u are unknown parameters.” The specification (5.11) has been applied by Dubin 


and McFadden (1984). However, (5.11) is not a Gorman Polar functional form. First, we obtain 


dV;(p.y-¢;] 
j j one 
(5.12) ae se iis -nV;(p,y —c,) 
and 

0V.(p,y—c 
(5.13) (Py De up, 

dy 

Consequently, by (5.6) 
5.14 X,, =(Z Ə Pi 
(5.14) Ta j¢+B Pp) +B Pp. + (ve) +e. 


Second, note that maximization of Vj (p. y-c ;) in (5.11) with respect to j is equivalent to 


maximizing 


since exp(—18p, ) does not depend on j. Therefore, the probability of choosing alternative j equals 


5 Note that (5.11) is not homogeneous of degree zero in prices and income. We may, however, interpret (5.11) as 
an indirect utility function in normalized prices and income. This is possible because a function v(p,y) of 
normalized prices and income is the indirect utility function of some locally nonsatiated utility function if and 
only if it is lower semicontinuous, quasi-convex, increasing in y, nonincreasing in p, and has vAp,Ay) 
nondecreasing in A. 
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P, = P(Z,0+B yp, +B;2P2 +@(y —c,)+¢, = max (Za +B iP; +Bk2P2 +0(y —c,)+8,)) 


(5.15) i exp(Z,0+B)P, +B;2p2 —0c;) 


p b3 exp (Z,.+B xP; +ByoP2 -Oc,) 
k 


Recall that while the unconditional mean of £j is 0.5772 the conditional mean of £j given that j is the 


chosen alternative is not equal to 0.5772. By Lemma A2 in Appendix A we have that when fe s are 


extreme value distributed then 


(5.16) E((y, +e; )u v; +£; = max, (V; +£, ))= pE max, (vk +&,). 


Since by Lemma 1 in Appendix A 


u max, (Vv, +e, )=ulog($,, e" )+en, 


where € has the same distribution as €,, it follows that 


uE(e;|v; +£; =max, (Vv, +e, ))}=uE(v; +8,|V, +€,=max,(v, +e, ))— Vj 
(5-17) = u E (max, (vy +£, ))- yj u=ulog(ġ, e” )-v; U+Eeu 


=ulog(Y, e“ )- v; u + 0.57724. 


From this result it follows that 


E(x,;|V;( Dy JE max, V, (p.y-c)} 
=(Z j0+B iP: +B;2P2 + Ly — c)n- +0572y 


-(Z, a+B pP: +B ,p, -8 cj)u +plog(>), exp(Z,0+B, iP; +Bi2P2 - 6c, )} 


(5.18) 


B. 
=05772u -—-+Ony + wlog(>, exp (Z,a +B Pi +B2.P2 -6c,)}, 


The interpretation of (5.18) is as the mean demand of good one given that j is the preferred 
discrete alternative. 
The result in (5.18) implies that if one runs regression analysis based directly on (5.14) this 


will produce biased estimates. Instead one should apply the specification 


(5.19) Xij ER; + Əpy + u log ($. exp (Zra +B, Pı +B. P2 -8c, )}+ n; 
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where Bi =0.5771p—B , /6 and n; is a random error term with the property that the mean of nj given 


that j is the chosen alternative equals zero. The estimation can be carried out in two steps: First 
estimate o, Bx, By. and @ by the maximum likelihood procedure. Second apply these estimates to 


compute 


os Now ZaeB, Pi +B, Po -00,) 
k 


which, analogous to Heckman's two stage procedure, is used as a known regressor in (5.19), and the 


remaining parameters 9, Bi and u can be estimated by OLS in a second stage. 


Example 5.2 


Assume that the conditional indirect utility has the Gorman Polar form with 


(5.20) a(p)=ao || px 
k 

and 

(5.21) b(p)=by [|| p% 
k 


where ao, bo, Ox, Bk are positive and 


> a= A, a, =l. 


k 


From (5.10), (5.20) and (5.21) it follows that 


(5.22) Zj P: =a(p)(B, - a, )m; - (y -c;)B, +a(p)(B, - ©, Je;. 


If fe i} are standard extreme value distributed the discrete choice probabilities are as in (5.9) with 
(5.20) inserted. If for example m j=Z y+ 0, where Zj is an observable attribute vector and y and ô 
are parameters, then if |Z j \ {c i} and ip i} vary sufficiently across a randomly selected sample of 
agents then it is possible to estimate y, TA } and ao from observations on the agents’ discrete choices. 


The remaining paramaters to be estimated are IB} and ô. These paramters can be estimated in a 


second stage by applying (5.22) and controlling for the selectivity bias as explained in Example 5.1. 
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5.4. Perfect substitute models 
We now consider choice problems in which there are m +1 goods of which m brands are perfect 


substitutes, cf. Hanemann (1984). The utility function has the structure 


(5.23) U(x, y,z) = ofS Yy 29 


k=] 


and the budget constraint is 


(5.24) » Pk Xk +Z=Y. 
k=] 


Here, fy, I are unknown parameters and U is a conventional utility function. Letting Y, x, =Z,, the 


corresponding utility maximization problem can be written as 


k=] 


(5.25) max U È Zuo | 


subject to 


(5.26) Y Fez, +z=y, x, 20. 


k=] k 


Clearly, this maximization problem implies a "corner" solution where the consumer selects the brand 


with the lowest "price", p, =p, /W,,. Thus, brand j is chosen if 


(5.27) Pi = min, (2) 
while x, = 0, fork + j. The corresponding indirect utility equals 


| p; 
(5.28) V,= max ue,z)= VE.) 


242; P;/W;=Y j 


where V(q,y) is the indirect utility that corresponds to the direct utility U(z j> z), 1.€., 


(5.29) . V(q,y)= max U(z;,z) 


z+qz j;=y 


Now assume that 


(5.30) logy, =Z,B/ut+e, /p 
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where Zj is a vector of non-pecuniary attributes associated with brand j while B and u >0 are 


unknown parameters and gj are i.i. standard extreme value distributed. Now from (5.23) and (5.25) we 


obtain that brand j is chosen if U; = max, U, , where 


(5.31) | U,=Z,P-ulogp; +£), 


and therefore the choice probabilities are given by 


exp(Z;B-ulogp;) 
(5.32) P= 
>, exp(Z,B-blog px) 


Note that in this case there are no fixed costs associated with the discrete choice. As above the ` 


continuous demands follow by applying Roy’s identity. 


Example 5.3 (Hanemann, 1984, p. 550) 








Let 
V(q,y) = sa _£ "930, n¥0, 
which yields 
(5.33) See søg”? ew, 


where ð; and 0, denote the respective partial derivatives, and therefore it follows from (5.27) that the 


continuous demand for brand j is given by. 
ð; V E ? | 
(5.34) Z. =-— ~ l ^ -= op?’ yr" en. 


From (5.34) and (5.30) we get 


log (3; pj)=l0g0 + (p—1)log y; +(1—p)logp; + ny 


—] —] 
= 10g0+ 227.8 + (-p)iogp, +nNny le, 


J 


(5.35) 


Hence, it follows that 
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E (log (x p; )| U, =max, U,)=log0+ny + mat E(U,|U, =max, U, ). 
From Lemma A2 in Appendix we have that 


E(U;|U;=max, U, )=EmaxU, 

(5.36) | | 

= 0.5772 + log b exp(Z,B-ulogp, ) 
k 


which implies that 


E(log(3; p;)| U; =max, U, } 
(5.37) 
-1 -1 
= jog0+ 0572 Eny Prog > ep (2,8-HloxP,)} 
k 


Similarly, Lemma A2 implies that 


(5.38) | Var (U,| U, =max, U, )= Var (max, U, ). 


Note that in the conditional expectations and variance above it is implicitly understood that y and 


IZ} are given. Apart from an additive deterministic term max,U;, has the same distribution as £). 
Consequently, (5.35) implies that 


_(p-1) 1” 


—] 
639 Var(ioe(;P))] Umar, U,)= Var Pte, = OO 


Suppose now that our sample only consists of a simple cross-section. Then, since {Z,, } do not vary 


across individuals we may write 


(5.40) log (X; p; )=atny +8, 
where 
—1 —1 | 
(5.41) a= 108040577229, Bio» exp (Z.B-nlogp ) 
H k 


and 6, is a random term which due to (5.41) has the property that 


E(3,| U; =max, U,)=0 
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=e 


Var (8;| U = max, Ux] TE 


Thus, the model parameters can be estimated in two stages. 

Stage 1: Estimate B and u from data on the discrete choices by means of model (5.32). 
Stage 2: Estimate a, ņ and p—1/p on the basis on (5.40). By inserting the estimates of a, u, p— 1 and. 
B in (5.41) an estimate of a can be obtained. 


Similarly to (5.37) it is easy to prove that 


log E(x, Pj | U; =max, U, ) 
(5.42) 


(p-1) 


=1080+1ogr {144=2) ny PDS exp (248-1108, 
k 


where T(-) is the Gamma function. Suppose now that microdata are not available but that one has 
macro time series data for IZ, }, X;P;» prices, the mean income, y, and the aggregate shares, iP, ig 


Then it is possible to use (5.32) and (5.42) to estimate the unknown parameters in a two-stage 


procedure. 
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6. Estimation 
We shall briefly review maximum likelihood estimation, Berkson’s method and finally Heckman’s two 


stage method. 


6.1. Maximum likelihood 
Suppose the multinomial probability model has been specified, for example as (2.2), (2.4), (2.6), or as 


a binary Probit model. Let Y, =1, if agent i in a sample of randomly selected agents, falls into 


category ] and zero otherwise, and let fH ; (X; )} be the corresponding multinomial logit probabilities 


given by (2.2) where X; is the vector of explanatory variables for agent i. The total likelihood of the 


observed outcome equals 


N 


I Ñ seo 


i=l j 

where N is the sample size. The loglikelihood function can therefore be written as 
N m 

(6.1) => y Yi log H,,(X; ). 
i=l j=l 


By the maximum likelihood principle the unknown parameters are estimated by maximizing @ with 
respect to the unknown parameters. 


The logit structure implies that the first order conditions of the loglikelihood function equals 





3L | 
(6.2) =) (X, -H,(X;)X,,)=0 


for r=2,3,...,.m, k=1,2,...,K, where X; is the k-th component component of Xj. 


When the logit model has the structure (2.6) then the first order conditions yield 





N m 
(6.3) aki =) $, (¥,;-H,(Z,X;))R,(Z;,X;)=0 
0B, i=] j=l 


for k =1,2,...,K. | 
McFadden (1973) has proved that when the probabilities are given by (2.6), the loglikelihood — 


function is globally strictly concave, and therefore a unique solution to (2.15) is guarantied. 
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6.2. Berkson's method 
If we have a case with several observations for each value of the explanatory variable it is possible to 
Carry out estimation by Berkson’s method (Berkson, 1953). Model (2.4) is an example of a case where 


this method is applicable, since this model does not depend on individual characteristics. Let 


N 


Ĥj = ` Yj 
N i=ł 
and replace Hj by H, in (2.5). We then obtain 
H, | 
(6.4) log a =(Z,-Z,)B+n;, 
l 


where nj is a random error term. By the strong law of large numbers Ĥ ; 2H, with probability one as 


the sample size increases, the error term nj will be small when N is "large". Also by first order Taylor 


approximation we get 


l fi, beea 2 Boi Huan 
Og| —— |=10 . — JO = jog! — PREE A pee eee 
8 H, 8 j 8 ] g H, H. H, 


which shows that 


H,) EH,-H, (EH,-H 
(6.5) vies) EArt. PH) g ~Z,)B 


oe) (2 -Z,)B=0. 


l 
Thus, even in samples of limited size the mean of the error terms fn j i: is approximately equal 


to zero. Define the dependent variable y by 


Y. =] Hy, 
- = 109| ~x |. 
j EEE 


We now realize that due to (2.16) we can estimate B by regression analysis with ta: as dependent 


variables and 1Z ;~2, l as independent variables. 
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6.3. Maximum likelihood estimation of the Tobit model 
Notice first that due to the form of (5.2) ordinary regression analysis will not do because of the 
nonlinear operation on the right hand side of (5.2). 


From (5.2) it follows that 


(6.6) P(Y =0)=P(u<-XB/ 0)=F(-XB/ 0) 


where F(y) denotes the cumulative distribution of u, and 


(6.7) P(Y e (y,y + dy))=P(uo e (y — XB, y + dy — XB))= tp() dy, 


for y >Q. Consider now the estimation of the unknown parameters based on observations from a 
random sample of N individuals, and as above, let i=1,2,... be an indexation of the individuals in the 
sample. Let S, be the set of individuals for which Y, >0 and So the remaining set of individuals for 
whom Y, = 0. We shall distinguish between two cases, namely the cases where we observe X; and Y; 


for all the individuals (Case I), and the case where we do not observe X; when 1€S, (Case ID. 


Case I: X; is observed for all ie S, US, (Censored case) 


From (6.7) it follows that the density of Y; when Y; >0 equals 


p (ee) 


(07 O 


while, by (6.6), the probability that i € S, equals 


(29) 


Therefore the total loglikelihood equals 


(6.8) [=> og p(X) log o) + Y log (22) 


ieS; 1ESp 


Example 6.1 


Suppose F(y) is a standard normal distribution function, P(y). Then since 


eu) = = enw /2 
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it follows that the loglikelihood in this case reduces to 


Y; — XB -XB 
(6.9) f=-) (i= XB) Niogo+ ¥. ogo £) 


ieS; 20 iESo 


We realize that applying OLS to the equation Y = XB + uo correspond to neglecting the last term in 


(2.20) and will therefore produce biased estimates. 


Example 6.2 
Suppose that F(y) is a standard logistic distribution, L(y), given by (2.9). Since 
1—L(-y)=L(y) and 


(6.10) L’(y)=L(y)(-L(y)) 


the loglikelihood function in this case is 


(6.11) =% (tog (22) togf1-1 (EEN) wiogo+ 5 jog (AF) 


ieS; iESo 


Case II: X; is not observed for i€ S, (Truncated case) 





In this case we must evaluate the conditional likelihood function given that the individuals 


belong to S;. The conditional probability of Y, e (y, y+dy), y>0, given that Y; >0 equals 





ere Prevveay) FS hs 
P(Y, e(y,y+dy) Y, >0) P(Y, e(y,y+dy 
sorra aeg PO) EB) 
(6) 


Therefore, the conditional loglikelihood given that Y, >0 for all i, equals 


2 (YE SB) fy =") f 
(6.12) =a il z ) lo F{ n j Nlogo. 


6.4. Estimation of the Tobit model by Heckman's two stage method 
Heckman (1979) suggested a two stage method for estimating the tobit model. We shall briefly review 


his method for the case where F(y) is either the normal distribution or the logistic distribution. 
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6.4.1. Heckman's method with normally distributed random terms 


As above ®(-) denotes the cumulative normal distribution function. From (5.2) we get 
(6.13) 


E(Y| Y>0)=XB+oE(ul Y >0). 
Since E (ul Y > 0) in general is different from zero we cannot, as mentioned above, do linear 


regression analysis based on the subsample of individuals in S;. Now note that 


P(u e (y, y +dy)| ¥>0)=P[ue(yy + dy) 
(6.14) 





>=) 
0 


X 
© 
since -u has the same distribution as u due to symmetry. We therefore get 


© 
(6.15) 


E(u| Y>0)= 


u@’(u)du. 
TZ) 
6 
But 














r *- ue 2 2 @ 2 1 x8) / (2) 
6.16 u®’(u) du = du =— | = -exp| -| — | /2 |= — 
oe | me J V27 -XB J2m 20 |-( 6 6 

oe oe j 
which together with (6.14) yields 
(6.17) 


where the last notation (A) is introduced for convenience. 


Heckman suggested the following approach: First estimate B/o by probit analysis, i.e., by 
maximizing the likelihood with the dependent variable equal to one if i €S, and zero otherwise. The 
corresponding loglikelihood equals 

X. 
(6.18) l=)" log of =F) 


X. 
+ ` log i — a(XB)) 
ieS; 0 1ESp O 
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From the estimates B" of B/o, compute 


and estimate B and © by regression analysis on the basis of 


(6.19) Y; =X,B+ 0A, +N; 


by applying the observations from S,. This gives unbiased estimates because it follows from (6.13) 


and (6.17) that 


E(n,| Y, >0)=E(Y, -X,B-o4,| ¥, > 0) 
= E(ou, - 04, | Y, >0)=oE(a, |Y; >0)—oA, 


on AB) oi, =0. 
Oo 


Heckman (1979) has obtained the asymptotic covariance matrix of the parameter estimates that take 


A 


into account that one of the regressors, A;, is represented by the estimate, À.. 


Note that this procedure leads to two separate estimates of 6, namely the one obtained as a 
regression coefficient in (6.19) and the one that follows by dividing the mean component value of the 


estimated B by the corresponding mean based on B“. 


6.4.2. Heckman's method with logistically distributed random term 
Assume now that u is distributed according to the logistic distribution L(y). Then by Lemma 2 in 


Appendix A it is proved that 


(6.20) E(u| Y > 0)=(1+exp(—XB / o))log (1+exp(XB / o))- XB/o. 


In this case the regression model that corresponds to (6.19) equals 


(6.21) Y, =X,B +066, +ñ, 
where 
(6.22) l 6. = (1+ exp(-X,B"))log (1+exp(X;B"))- X;B" 


and Ø' is the first stage maximum likelihood estimate of B/6 based on the binary logit model with 


loglikelihood equal to (6.18) with ®(y) replaced by L(y). 
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A modified version of Heckman's method 


Since 


1 


AR 


it follows from (6.20) that 


(6.23) 
EY = P(Y >0)(E(u| Y >0)o+XB) 
= olog (1 +exp(XB / s)) — XB(1 +exp(—XB / 6)) + XB(1 +exp(-XB / o)) = olog (1 +exp(XB / o)) 
= Glog (1+exp(—XB/))+ XB = XB - o log P(Y > 0). 


Eq. (6.23) implies that we may alternatively apply regression analysis on the whole sample based on 


the model 

(6.24) X =X ß+0ű, +ô. 
where 

(6.25) fi, = log (1+ exp(-X,B")) 


and 6; is an error term with zero mean. This is so because (6.23) implies that 


Eô; =E(Y, —X,B+ologP(Y, >0))=0. 


With the present state of computer software, where maximum likelihood procedures are readily 


available and easy to apply, Heckman’s two stage approach may be of less interest. 


6.5. The likelihood ratio test | 
The likelihood ratio test is a very general method which can be applied in wide variety of cases. A 
typical null hypothesis (H) is that there are specific constraints on the parameter values. For example, 


several parameters may be equal to zero, or two or more parameters may be equal to each other. Let 


8" denote the constrained maximum likelihood estimate obtained when the likelihood is maximized 
subject to the restrictions on the parameters under H. Similarly, let B denote the parameter estimate 


obtained fronr unconstrained maximization of the likelihood. Let ((ĝ" ) and (8) denote the 


loglikelihood values evaluated at BH and Ø, respectively. Let r be the number of independent 


restrictions implied by the null hypothesis. It can be demonstrated that 
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is asymptotically chi squared distributed with r degrees of freedom. Thus, if —2 (2(6" )- “(8)} Is 


"large" (i.e. exceeds the critical value of the chi squared with r degrees of freedom), then the null 
hypothesis is rejected. 

In the literature, other ypes of tests, particularly designed for testing the "Independence from 
Irrelevant Alternatives" hypothesis have been developed. I refer to Ben-Akiva and Lerman (1985), p. 


183, for a review of these tests. 


6.6. McFadden's goodness-of-fit measure 


As a goodness-of-fit measure McFadden has proposed a measure given by 


48) 


6.2 ES eee 
(6.26) po = TO) 


where, as before, (8) is the unrestricted loglikelihood evaluated at B and £(0) is the loglikelihood 


evaluated by setting all parameters equal to zero. A motivation for (2.38) is as follows: If the 


estimated parameters do no better than the model with zero parameters then (B)= £(0) and thus 


p? =0. This is the lowest value that p° can take (since if (8) is less than £(0), then B would not be 


‘the maximum likelihood estimate). Suppose instead that the model was so good that each outcome in 


the sample could be predicted perfectly. Then the corresponding likelihood would be one which 


means that the loglikelihood (8) is equal to zero. Thus in this case p? =1, which is the highest value 


. . e . … 9 . . 
p° can take. This goodness-of-fit measure is similar to the familiar R* measure used in regression 
analysis in that it ranges between zero and one. However, there are no general guidelines for when a 


p° value is sufficiently high, cf. sections 4.8 and 4.10. 
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7. Advanced examples of discrete/continuous choice analysis 


7.1. Behavior of the firm when technology is a discrete choice variable 


Suppose the firm faces the choice of choosing one out of m possible technologies. Let 


(7.1) nj =£(p;,4) exp(e;/2). 

j=1,2,...,m_,, be the firm’s profit conditional on technology j, where p is the output price, qj is a 
vector of input prices, &; is a random term that accounts for unobservable variables that affect 
production with technology j. We assume that fe i} are 1.1. standard extreme value distributed and 


a >0 is a constant. We realize that when a decreases then the effect of unobservable heterogeneity 


will increase. 


By Hotelling’s Lemma we obtain that output, Y, conditional on technology j, is given by 


Of(p.,q, 
(7.2) | 7, ert) y (e/a) 
j 


and similarly input of type r, conditional on technology j is equal to 


(7.3) | X = PHP) pfe, ja) 
drj 

Let 

(7.4) V; = alogf(p;.4;)+¢;- 


It follows from (6.1) and (6.4) that the probability that the firm shall choose technology j equals 


exp(ailog f(p,.q;)) 


2: exp (a log fP) 


(7.5) p = P(x, =max, T, )=P(V, =max, V, )= 


j 
Recall that by Lemma A2 in Appendix A 
(7.6) P (max, V, Sy| V; =max, V, )= P (max, V, Sy). 


Therefore we obtain that 


(7.7) Eexp(—V, | V, =max, V, }=Eexp(—max, a} 
Oo Oo 
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Moreover, 


(7.8) P(max, V, <y)=I Í P(V, <y)=exp(-e”A) 
k 

where 
(7.9) A=), exp (œ log f(P,.4x )). 
Hence 

] o 
7.10 — V, |= | e“ -exp(-e” A)Ae”d 
(7.10) Eexp[ + max, | Je exp( e e "dy 


which by change of variable, Ae” =x , reduces to 


T 1 
(7.11) Eexp( = max, v, |= AYS | x! e™ dx = AM r1-=) 
Oo i Å a 


provided œa >1. When a <1 this mean is infinite. From (7.2), (7.7) and (7.11) we get 


~ |. _ 9f(p;.q;) I _ 

E(y, | T;= max, nle a rav] V; = max, v) 

(7.12) _ rosea) 
Pj 

_ dlogf(p;.qj) 

=— p 


li 
E exp E max, V, ) 


[E exp (ctost(p,.a.))}" r (1-1) 


j 


Similarly, it follows that 


dlog f(p;.45) 
dq 


(© exp(alogf(p a) r(i-+}, 


(7.13) E(R, 





rj 


l 


(7.14) B(r, | T; =mMax, T, )= E (max, m,.)=(>), exp(alogt(p,.4x))] r(1-+) 


and 


0.5772 


(7.15) E(iog r; |x; =max, m, )=E (max, logn,)=—log| >), exp (clog f(Px qx ))|+ m 





From the results above we can deduce an interesting aggregation property. We get from (7.14) that 
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dE a- dlogf(p;.q; 
mend rha, exp (log (prax ))] exp (aloet(.4,)) e 


(7.16) 


a Ologf(p.,q; 
r(1——)[S, exp(alogf(p,.a))] ea 


J 


But by comparing (7.12) and (7.16) we realize that 


dE (max, 7, ) 





(7.17) Jp =P, E(F; |7; = max, T, )=EY;. 
j 
Similarly, it follows readily that 
OE(max, 7 
(7.18) “oe w(x, T,= max, m,)=EX,. 
rj 


Finally, it can easily be demonstrated that 


(7.19) _ dlogE (max, Ty) 


dlog t, 


The results above demonstrate that assumptions (7.1) and (7.2) imply that it is possible to 
define a representative agent with profit function E(max, T, ), from which one can derive fractional 


technology choice rates, P;, and aggregate demands. These are equivalent to the choice probabilities 


and aggregate demands and production derived from profitmaximizing micro agents. 


7.2. Labor supply with taxes (I) 

This example is an extension of the example in section 4.1. Consider the choice of "working" versus 
"not working", and annual hours of work when working. We assume that there is no rationing in the 
market so that of the agent wishes to work he will be able to get work. Let the agent’s utility function 


in consumption and (normalized) leisure, L=1—h/M, be given by 


h \” a 
… (cm = 1B, (3 -ijem 
(7.20) V(C,L) = "TR A 


Q, OQ, 


where M = 8760, is total number of hours a year, h is hours of work and a, <1,a, <1, 


B, >0,B, >0. The budget constraint is given by 


(7.21) C = hW + I — S(hW, I) 
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where W is the wage rate, I is nonlabor income and S(-) is the tax function. There is no fixed cost of 
working. 


The marginal rate of substitution equals 


h Ven 
a,V(C,L) (ER Pr 


(7.22) = = 
d, V(C,L) B,c™ 

Let 

(7.23) g(x, yJ=x+y—S(x, y). 


Then it follows that the agent wishes to work if 


d2 V(g0, 1,1) _ B, g0, D 


7.24 W 9,g(0, I) > = , 
sd a d, V(g(0,1),1) B, 


and hours of work, h , is determined from 


- d,V(g(hw,1),1—h/M Be a. ees: 
(7.25) va FEEL Mag, ey g(ñw,1) h, 


provided (7.24) holds. The left hand side of (7.24) is called the marginal wage rate at zero hours of 


work, and the right hand side of (7.24) is called the reservation wage. Assume that B, /B, and W are 


specified as in (4.7) and (4.8). 


Estimation by Heckman's two stage method 


From (7.25) we have that hours of work is determined by 


mp 


: i ; B 
(7.26) (a,—1)log i ae = log W + log d,g(hW,1) + (a, —1)logg(hW,1)— log E) 


l 


provided (7.24) holds. Therefore, we face the usual "Tobit problem" that the random term, €, —€,, 
does not have zero expectation and consequently we cannot apply standard regression analysis. Both 


h and W are endogenous variables. h is endogenous because it is the hours of work function. 


Although W is exogenous theoretically it may be endogenous statistically due to unobservables that 


affect preferences through the hours of work function. If log (B, /B, ) are replaced by (4.7) and we 


divide both sides of (7.26) by &, —1 we obtain 
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me 


(7.27) og ae = max (0,— X,br, +1, ElogW +r, logd,g(hW,1)+r, logg(hW,1)+r,(€; —£, )) 


where r =1/(1- a, ) and r, = (a, —1)/(1-@,), and where ElogW is given by (4.8). Now the 
labor supply eq. (7.27) is well defined for both working and non-working individuals. However, it is 


nonlinear in parameters, and there still remains the endogenous variable hW on the right hand side. 
On the subsample of those who work it is, however, linear, but we cannot apply standard regression 
analysis because, in addition to the endogeneity problem, the conditional expectation of the error 
terms given the subsample of workers is not equal to zero. To account for these problems we shall 


apply Heckman’s two stage method. Let 


where 


t =1, Var(e, -€,). 


By applying the result obtained in section 6.4.1, it follows that 


o( = +r, logd,2(0,1) +r, log g(0, 2 
T 


7.29 A= 
(7.29) P, 


where P; is the probability of working, and can be written as 


(7.30) o p, Jo nitong di a esii 


T 


and where Xs =X a- X,b. Hence, it follows that 


(7.31) ef -ios(: = =) h> o| = Xsr +1, log dg (Wh,I) + T, log g(Wh,1) + TÅ 
which means that we can write 

h sg ig 
(7.32) -1og(: -È = Xsr, +r) logd,g(Wh,1) + r, log g(Wh,1)+ tà +n, 


where n2 is a random term with the property that 
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E(n,| h >0)=0. 
Similarly, it follows that 


(7.33) E (log W| i> 0)=X,a+ pr 
where 


p=corr(e,,€, — £3 ). 


The relation (7.33) is useful because it enables us to estimate the wage equation from a sample of 
working individuals, as we shall see in a moment. The term pTA in (7.33) may be called the 
"selectivity bias". It is different from zero when p #0 due to the fact that in this case there is 
correlation between the random term in the wage equation and the sample selection criteria (namely, 


h>0). Due to (7.33) we can write 


(7.34) log W=X,a+pth+n, 


where 
E(n,|h>0)=0. 


If A were known it would be possible to estimate (7.32) and (7.34) as a simultaneous equation system. 
Unfortunately, À is unknown and this is therefore not possible. We can, however, apply the estimates 


from the probability of working to obtain an estimate of A. 


Step I 
Estimate the parameters of the probit model (7.30) on the basis of discrete observations on 


whether the agents are working or not working. 


Step 2 
Estimate the wage equation (7.34) by using A as aregressor, where A is an estimate of A 


obtained from step one. 


Step 3 
Replace logo, g(Wh,1) and log g (Wh, I) by instrument relations 
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(7.35) log ,g(Wh,I)=Z0, +u; 


and 


(7.36) log g(Wh,1)=Z0, +u, 


where Z is a set of instrument variables; Z=(X,1), and u; and u, are zero mean random terms. 


Estimate (7.35) and (7.36). 


Step 4 
Insert À and the estimated wage equation (without the selectivity term) and the estimated 
instrument relations (7.35) and (7.65) into (7.32) from which the structural parameters can be 


estimated. 


Estimation by maximum likelihood 


Since £; and £, are normally distributed we can write 


(7.37) E€, — E€ =9E, +E, 


where &; is a zero mean normal variable that is independent of £, and 0 is some constant. Let S2 be the 
subsample of individuals that work and S, the subsample of individuals that do not work. Let i index 


individual i. From (7.26), (4.7), (4.8) and (7.37) we have that when h; >0 


h; ~ 
7.38) €,,=—0e,, + (1-0, oe. +X,at log d,g(h; W, L) 


+ (æ; —1)log g(h; W, ,1;)- X,;b. 
Note that we can express £4; as 


(7.39) €; =log W, - X,,a. 


Let l, be the (conditional) loglikelihood for the subsample of individuals that work. From (7.38) we 


have 


dE; A, —1 rad d(WihL) (€; —1)W, 3,g(W,h,.1, ) 
dh, M-h, 3,8(W,h, I; | g(W,h, 1, | ) 


1 


(7.40) 


The loglikelihood for the subsample of those who work becomes 
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log a, g(h; W, I. }+ (ot, - 1)log g(h, W, ,1,)-@log W, +Xya(0e1)-26,04(1~04 tog 1E 





eS, 0; 0, 


where ®’(-) is the standard normal density, of =Vare,, and of = Vare,.. 


The likelihood for non-working individuals equals 


l g(0, I, —1)l NE d 
(7.42) exp £, =I I of Se eS 


Oo 


ieS, 


where 6” = Var (£, =£] ). The total loglikelihood, Z , is therefore equal to 


C=, +b. 


Results from empirical analysis of a sample of married women in Norway, 1979/1980 
Dagsvik et al. (1986) analyze female labor supply in Norway based on a sample of married women 
from the level of living survey/tax return files, 1979/1980, by applying the model discussed above. 
The variables that affect the women’s preferences are specified to be "Age", "Age squared", "Number 
of children below six years of age", "Number of children above six years", a disability dummy and an 
index of job opportunities for women. 

The variables that affect the wage quation are assumed to be "Age", "Age squared" and 
"Years of education". 


The estimates obtained by the four step procedure are displayed in Tables 7.1 and 7.2 below. 
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dE i; 
oh 





Table 7.1. Estimates of the parameter in the utility function 


Independent variables | Estimate Standard deviation 
Intercept -5.35 0.80 
age 0.158 0.03 
10” x age squared -0.205 0.03 
Number of children less than six years -0.289 0.07 
Number of children above six years -0.079 0.04 
Disability index -0.398 0.09 
Index of job-opportunities 0.727 0.59 
a, (Consumption) 1.0 
Q» (Leisure) -4.28 0.11 
Marginal wage (1/0) 0.965 0.13 


Table 7.2. Estimates of the wage equation 


Independent variables Estimate Standard deviation 
Intercept 2.161 0.28 

Years of education 0.065 0.01 

Age 0.030 0.01 

107. x age squared -0.032 0.01 
Selectivity, À -0.105 0.06 

R? 0.16 


7.3. Labor supply with taxes (II) 
We will now consider the case where £; and € are jointly extreme value distributed. Dagsvik et al. 
(1988) have analyzed female labor supply in France based on the model formulation above, but where 


(e, i E, ) are bivariate extreme value distributed instead of bivariate normal. Thus, 


(7.43) P(e, Sy,,€5 <y,)=exp [fer +e772/0S y) 


where p,0<p<1, is related to the correlation coefficient by 





(7.44) l corr (£1,£3)=1- p? 
and 

2.2 
(7.45) ZO =Vare, =Vare,. 
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Moreover, it follows that 


2 
(7.46) t? = Var (e; -8,)= 0 p’. 


Since €; and £, are jointly extreme value distributed we get by Theorem 7 that 


P(e, <€, ty)=P(Hb<S2 42) 
(7.47) oo 


i exp (y/po) z l 
1+exp(y/po) 1+exp(-y/op) 


which means that €, — €, ‘has a logistic distribution. From (7.47) and (7.27) we get 


1 


7.48 P (b> 0)=—— 
(7.48) (h>o) 1+exp(—(X sr, +r, logd,g(0,1) +r, log g(0,1))/ po) 


From Lemma A2 in Appendix A we get 


log(1-P(h >0)) 


pi> 5 -+(x sr, +r, logð, g(Wh,1)+ r, log g(Wh,1)). 


a. x 
(7.49) h=—E(e, -£,|h>0)=- 
From (7.32), (7.48) and (7.49) we thus obtain 
h = ~ ~ o 
(7.50) — log É -E = Xsr, +1, log d,g(Wh,I) +r, log g(Wh,1) +TA +N, 


where 1, is a random term such that E (ial h > 0)=0. Similarly, it can be proved that 


(7.51) logw=X,a-—pologP(h>0)+ fj 


where 1, is a random term such that E (fi h> 0)=0. 


It is now clear that the model specified above can be estimated in the same way as the model 


specification in Section 7.2. 
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Appendix A 


Some properties of the extreme value and the logistic distributions 


In this appendix we collect some classical results about the logistic and the extreme value 


distributions. 

Let X,,X.,..., are independent random variables with a common distribution function F(x). 
Let 
(A.1) M, = max(X,,X),...,X, ). 

Theorem Al 


Suppose that, for some a>0, 
(A.2) lim x” (1— F(x))=c, 


where c>0. Then 





-Q 
(A.3) lim p| <x aJer(-x*) for x>0, 
nam |(cn)" 0 for x<0. 
Theorem A2 
Suppose that F(x, )=1, and that for some a>0, 
(4.4) Jim (xo — x) (1-F(x))=c, 
where c>0. Then 
(A.5) him p Maxtocy)_Jew(-ll") sor x<0 
noe \ (on) l for x20. 


Theorem A3 


Suppose that, for some a>0, 
(A.6) lim e* (1— F(x))=c, 
where c>0. Then 
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(A.7) lim P(M, — log (cn)S x) = exp (e” ) 


for all x. 


Proofs of Theorems Al to A3 are found in Lamperti (1996), for example. Moreover, it can be 
proved that the distributions (A.3), (A.5) and (A.7) are the only ones possible. 

The three classes of limiting distributions for maxima were discovered during the 1920s by 
M. Fréchet, R.A. Fisher and L.H.C. Tippet. In 1943 B. Gnedenko gave a systematic exposition of 
limiting distributions of the maximum of a random sample. 

_ Note that there is some similarity between the Central Limit Theorem and the results above in 
that the limiting distributions are, apart from rather general conditions, independent of the original 
distribution. While the Central Limit Theorem yields only one limiting distribution, the limiting 
distributions of maxima are of three types, depending on the tail behavior of the distribution. The 
three types of distributions (A.3), (A.5) and (A.7) are called standard type I, II and III extreme value 
distributions. 

The extreme value distributions have the following property: if X, and X, are type III 


independent extreme value distributed with different location parameters, 1.e., 


P(X, < x; )= exp (et ) 


where b, and b; are constants, then X= max (X, , X, ) is also type III extreme value distributed. This 


is seen as follows: We have 


P(X<x)=P((X, <x)Nn(X, <x)) 
= P(X, <x)P(X, <x)=exp (er) exp (e">) 


serpe (et re" ))=erp (ert) 


where 
b= log (e” +e” ). 


Similar results hold for the other two types of extreme value distributions. 

In the multivariate case where the random variables are vectors, there exists similar 
asymptotic results for maxima as in the univariate case, where maximum of a vector is defined as 
maximum taken componentwise. The resulting limiting distributions are called multivariate extreme 


value distributions, and they are of three types as in the univariate case. A characterization of type III 
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is given in Theorem 7 in Section 3.10. More details about the multivariate extreme value distributions 
can be found in Resnick (1987). 


A general type III extreme value distribution has the form 


exp (=e PY a ) 


and it has the mean b +0.5772...., and variance equal to a7n” /6. 


Lemma Al 


Let £ be standard type III extreme value distributed and let s< 1. Then 


Ee =T(1—s) 


where I (-) denotes the Gamma function. 


Proof: 


We have 


Peca f es" exp(-e” Je dx. 


—00 


By change of variable t=e * this expression reduces to 


Q.E.D. 


Lemma A2 


Suppose U, =v; +€,, where (€,,€,...,€,,) is multivariate extreme value distributed. Then 


P(max, U, <y|U, =max, U, )=P(U, sy U, = max, U, )= P (max, U,<y). 


Proof: According to the definition of the multivariate extreme value distribution 


(A.8) P(U, Sy,,U,Syp,...,U,, < yn )=F(y;,Y2,--.)=exp(-G(e"™,e7%,...,e% )) 
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where G(-) is homogeneous of degree one. For notational simplicity let j=1, since the general case is 


completely analogous. We have 


(A.9) | 
P (max, U, €(z,z + dz), U, =max, U,)=P(U, €(z,z+dz),U, $z,...,U,, $z)=0,F(z,z,...,z)dz. 


Since 
(A.10) Gere. Se Gee wee) 
we get 
(A.11) OF (z,z,...)= exp(-e™ G (EE a6" ))0,G ee nge= )e”7. 
Hence 

P(max, U, Sy, U, =max, U,)= f ð F(z,z,...,z)dz 
(A-12) =e" 3,G(e",e",..e°) | exp(-e*G(e",e%,....€"* Je dz 


e“ 9,G(e",e”,....e""} 


-exp(—e ’ G(e”',e”,...,e°™ |]. 
V} Vo Vv 
G(e ,@ 7,...,8 =) 


But the last factor in (A.12) equals P (max, U, <y), as is easily seen from (A.9) and (A.10). 
Moreover, by Theorem 7 the first factor on the right hand side of (A.12) equals P (U, =max, U, ). 


Thus the events fU; =max, U, band {max , Ups y} are stochastically independent. 


Q.E.D. 


Lemma A3 


Assume that Y = u +ou, where 


Paane m 
l+exp(—y) 
Then 
exp -#) 
oO 
‘A.13 P (u> y| Y > 0}= —— 
ae (u y ) 1+exp (y) 
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for y>- [nd] and equal to one for y <É. Furthermore, 
O O 


u HM) w_ logP(Y<0) u 
(4.14) Eqa ¥>0)=[1+0%{—£) og 11eg £))-4--287 09 


Proof: 


For yout we have 
© 


P(u>y|Y>0)= | 


pfu > -4) 
(A.15) 9 


H 
_Peac-y)_Pluc-y)_ tel 3 
puck) Pack) FPO) 
0] Oo 
which proves (A.13). 


Consider next (A.14). Let Y= Y/o. Then for y20 


P(¥>y,¥>0) | P(Y>y) 


(A.16) P(¥>y|¥>0)= ED EET ES SR! 
Oo 


Hence 


(A.17) =[1+exp(- Mi ERE ot) m mal) 


0 Itep Ey 


(not) 


a |E 
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This implies that 


E(u] Y>0)=E(¥|¥ >0)-H=(i+exp{-#)}iog{1+ex0(£))-= 


and (A.14) has thus been proved. 
Q.E.D. 
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Appendix B 


The Tax function applied in Dagsvik et al. (1986) — 
Let | | 


0.053x, xe[0,3000] 

3.38-107 (x—3000), x €[3000, 49826] 

3.38-107 (0.81x +6467)" +0.053x, x €[49826, 23700] 
—27472+0.651x, xe[237000,—). 


W(x) = 


Then the tax function is given by 
T (hw, I)= wy (hw +I), 


when hw or I are less than NOK 22 000, and 


T (hw,1)= y(hw)+ y (1) 


otherwise. 
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